DEA uses XArray as its data model. To better understand what it is, let's first do a simple experiment on how we could pack remote sensing data using a combination of plain numpy arrays and Python dictionaries.

Suposse we have a satellite image with three bands: Red, NIR and SWIR. These bands are represented as 2-dimensional numpy arrays. We could also store the latitude and longitude coordinates for each dimension using 1-dimensional arrays. Finally, we could also store some metadata to help describe our images.

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from check_answer import check_answer

red = np.random.rand(250,250)
nir = np.random.rand(250,250)
swir = np.random.rand(250,250)

lats = np.linspace(-23.5, -26.0, num=red.shape[0], endpoint=False)
lons = np.linspace(110.0, 112.5, num=red.shape[1], endpoint=False)

title = "Image of the desert"
date = "2019-11-10"

image = {"red": red,
         "nir": nir,
         "swir": swir,
         "latitude": lats,
         "longitude": lons,
         "title": title,
         "date": date}

All our data is conveniently packed in a dictionary. Now we can use this dictionary to work with it:

image["date"], image["latitude"][:4]

We can address any variable inside this image dictionary and work directly with other functions. For example, to plot the nir band and calculate its mean:

plt.imshow(image['nir'])

image["nir"].mean()

Still, the variables inside our dictionary are independent and we don't know how they are linked. For example, we have the variable latitude but we don't know to what axis in the image arrays it refers. We also need to use positional indices to select parts data in the numpy arrays containing the image data. Wouldn't it be convenient to be able to select data from the images using the coordinates of the pixels instead of their relative positions?

This is exactly what XArray solves! Let's see how it works:

import xarray as xr
from datetime import datetime

To explore XArray we have a file containing some reflectance data of Canberra that has been generated using the DEA library.

The object that we get ds is a XArray Dataset, which in some ways is very similar to the dictionary that we created before, but with lots of convenient functionality available.

ds = xr.open_dataset('data/canberra_ls8.nc')

ds

A Dataset can be seen as a dictionary structure packing up the data, dimensions and attributes all linked together.

Variables in a Dataset object are called DataArrays and they share dimensions with the higher level Dataset

So far, we have been using 3-dimensional numpy arrays in which the third dimension represented the bands of images and remote sensing data. Numpy can store data in up to 32 dimensions so we could for example use 4-dimensional arrays to store multispectral images with a temporal dimensions, to perform time series analysis.

To facilitate working with these data, DEA follows the convention of storing spectral bands as separate variables storing each one as 3-dimensional cubes containing the temporal dimension.

To access a variable we can access as if it were a Python dictionary, or using the . notation, which is more convenient.

ds["green"]

#or alternatively 

ds.green

Dimensions are also stored as numerical arrays with the same size as the image's axis they are referring.

ds['time']

#or alternatively 

ds.time

Metadata is referred as Attributes and is internally stored under .attrs, but the same convenient . notation applies to them.

ds.attrs['Conventions']

#or alternatively 

ds.Conventions

Exercise 5.1: Can you access to the geospatial_bounds_crs value in the attributes of this XArray Dataset?

answ = ?

check_answer("5.1", answ)

DataArrays store their data internally as multidimensional numpy arrays. But these arrays contain dimensions or labels that make it easier handle the data. To access the underlaying numpy array of a DataArray we can use the .values notation.

arr = ds.green.values

type(arr), arr.shape

Exercise 5.2: Can you store in the answ variable the underlying numpy array containing the longitude dimension in this Dataset?

answ = ?

check_answer("5.2", int(answ[0]*1e6))

Selecting data and subsetting numpy arrays is done using positional indices to specify positions or ranges of values along the different axis of an array. When we use the [:,:] notation, we need to know beforehand what is the relative position of each axis in our arrays.

XArray provides an abstraction in which we can refer to each axis by its name. Also we can select subsets of the data arrays using two modes or methods:

  • isel(): For selecting data based on its index (like numpy).
  • sel(): For selecting data based on its dimension of label value.

For example, for selecting the first element in the temporal dimension of the green variable we do:

print("Initial time dimension values:", ds.green.time.values)

ss = ds.green.isel(time=0)

ss

On the other hand we can use the .sel() method to select parts of the array by their label or content. See that in this case we do not refer to the data by its positional index but by its dimensional value.

ss = ds.green.sel(time=datetime(2016,1,1))

ss

Both methods sel() and isel() can receive as many arguments as dimensions have the data array. We can use any order in to pass the dimensions and we can also define slices or ranges of values using the slice() notation. For example:

ss = ds.green.sel(time=datetime(2016,1,1), latitude=slice(-35.30,-35.24))

ss

Exercise 5.3: Can you select the region of the red variable delimited by these coordinates:

  • latitude [-35.30,-35.29]
  • longitude [149.11,149.13]
answ = ds.?

check_answer("5.3", answ.shape)

When we use the selection methods on Datasets and DataArrays we get an object of the same type.

ss = ds.green.sel(time=datetime(2016,1,1), latitude=slice(-35.30,-35.24))

type(ss), type(ds.green)

Exercise 5.4: Use the imshow function to create an image of the first time of the red channel in the dataset.

Tip: Use the .values method to convert the DataArray object into a numpy array, so matplotlib can work with it.

answ = ?

plt.imshow(answ)

check_answer("5.4", int(answ[0,0])),

Xarray exposes lots of functions to perform analisis on Datasets and DataArrays with a similar syntax to numpy's. For example to calculate the spatial mean of the green band

print("Mean of green band:", ds.green.mean())
print("Standard deviation of green band:", ds.green.std())
print("Sum of green band:", ds.green.sum())

Exercise 5.5: Can you find the difference between the means of the red and nir channels?

answ = ?

check_answer("5.5", int(answ.values))

Plotting is also conveniently integrated as a method on DataArrays.

Note: For plotting you need to pass a 2-dimensional DataArray object, so normally a temporal element needs to be selected.

ds["green"].isel(time=0).plot()

We still can do things manually using numpy and matplotlib

rgb = np.dstack((ds.red.isel(time=0).values, ds.green.isel(time=0).values, ds.blue.isel(time=0).values))
rgb = np.clip(rgb, 0, 2000) / 2000

plt.imshow(rgb)

The previous image is upside down, so we'd still need to flip the image vertically in numpy to represent it correctly. This has to do with how numerical arrays are stored in netCDF files.

But compare to these chained operations within XArray (Well see more simple ways of doing this in DEA though)

#   Selection of the bands  | time sel  | numpy conv|    plot   (params for plotting function)

ds[['red', 'green', 'blue']].isel(time=0).to_array().plot.imshow(robust=True, figsize=(6, 6))

Exercise 5.6: Similarly to the previous image, create an RGB image using the .sel() functionality select the subset defined by the following dimension values:

  • time -> 2017-01-01
  • latitude -> [-35.29, -35.27]
  • longitude -> [149.1, 149.13]
answ = ?

answ.to_array().plot.imshow(robust=True, figsize=(6, 6))

check_answer("5.6", answ.to_array().values.shape)

Exercise 5.7: Can you create an NDVI representation of the whole extend in ds?

answ = ?

answ.isel(time=0).plot(figsize=(6, 6), cmap='summer_r')

check_answer("5.7", int(answ.values[0,100,100]*1000))