Introduction to XArray
This tutorial introduces XArray, a Python library for working with labeled multidimensional arrays.
DEA uses XArray as its data model. To better understand what it is, let's first do a simple experiment on how we could pack remote sensing data using a combination of plain numpy arrays and Python dictionaries.
Suposse we have a satellite image with three bands: Red, NIR and SWIR. These bands are represented as 2-dimensional numpy arrays. We could also store the latitude and longitude coordinates for each dimension using 1-dimensional arrays. Finally, we could also store some metadata to help describe our images.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from check_answer import check_answer
red = np.random.rand(250,250)
nir = np.random.rand(250,250)
swir = np.random.rand(250,250)
lats = np.linspace(-23.5, -26.0, num=red.shape[0], endpoint=False)
lons = np.linspace(110.0, 112.5, num=red.shape[1], endpoint=False)
title = "Image of the desert"
date = "2019-11-10"
image = {"red": red,
"nir": nir,
"swir": swir,
"latitude": lats,
"longitude": lons,
"title": title,
"date": date}
image["date"], image["latitude"][:4]
plt.imshow(image['nir'])
image["nir"].mean()
Still, the variables inside our dictionary are independent and we don't know how they are linked. For example, we have the variable latitude
but we don't know to what axis in the image arrays it refers. We also need to use positional indices to select parts data in the numpy arrays containing the image data. Wouldn't it be convenient to be able to select data from the images using the coordinates of the pixels instead of their relative positions?
This is exactly what XArray solves! Let's see how it works:
import xarray as xr
from datetime import datetime
To explore XArray we have a file containing some reflectance data of Canberra that has been generated using the DEA library.
The object that we get ds
is a XArray Dataset
, which in some ways is very similar to the dictionary that we created before, but with lots of convenient functionality available.
ds = xr.open_dataset('data/canberra_ls8.nc')
ds
A Dataset
can be seen as a dictionary structure packing up the data, dimensions and attributes all linked together.
Variables in a Dataset
object are called DataArrays
and they share dimensions with the higher level Dataset
So far, we have been using 3-dimensional numpy arrays in which the third dimension represented the bands of images and remote sensing data. Numpy can store data in up to 32 dimensions so we could for example use 4-dimensional arrays to store multispectral images with a temporal dimensions, to perform time series analysis.
To facilitate working with these data, DEA follows the convention of storing spectral bands as separate variables storing each one as 3-dimensional cubes containing the temporal dimension.
To access a variable we can access as if it were a Python dictionary, or using the .
notation, which is more convenient.
ds["green"]
#or alternatively
ds.green
ds['time']
#or alternatively
ds.time
ds.attrs['Conventions']
#or alternatively
ds.Conventions
answ = ?
check_answer("5.1", answ)
arr = ds.green.values
type(arr), arr.shape
answ = ?
check_answer("5.2", int(answ[0]*1e6))
Selecting data and subsetting numpy arrays is done using positional indices to specify positions or ranges of values along the different axis of an array. When we use the [:,:]
notation, we need to know beforehand what is the relative position of each axis in our arrays.
XArray provides an abstraction in which we can refer to each axis by its name. Also we can select subsets of the data arrays using two modes or methods:
isel()
: For selecting data based on its index (like numpy).sel()
: For selecting data based on its dimension of label value.
For example, for selecting the first element in the temporal dimension of the green
variable we do:
print("Initial time dimension values:", ds.green.time.values)
ss = ds.green.isel(time=0)
ss
ss = ds.green.sel(time=datetime(2016,1,1))
ss
ss = ds.green.sel(time=datetime(2016,1,1), latitude=slice(-35.30,-35.24))
ss
answ = ds.?
check_answer("5.3", answ.shape)
ss = ds.green.sel(time=datetime(2016,1,1), latitude=slice(-35.30,-35.24))
type(ss), type(ds.green)
answ = ?
plt.imshow(answ)
check_answer("5.4", int(answ[0,0])),
print("Mean of green band:", ds.green.mean())
print("Standard deviation of green band:", ds.green.std())
print("Sum of green band:", ds.green.sum())
answ = ?
check_answer("5.5", int(answ.values))
ds["green"].isel(time=0).plot()
rgb = np.dstack((ds.red.isel(time=0).values, ds.green.isel(time=0).values, ds.blue.isel(time=0).values))
rgb = np.clip(rgb, 0, 2000) / 2000
plt.imshow(rgb)
# Selection of the bands | time sel | numpy conv| plot (params for plotting function)
ds[['red', 'green', 'blue']].isel(time=0).to_array().plot.imshow(robust=True, figsize=(6, 6))
answ = ?
answ.to_array().plot.imshow(robust=True, figsize=(6, 6))
check_answer("5.6", answ.to_array().values.shape)
answ = ?
answ.isel(time=0).plot(figsize=(6, 6), cmap='summer_r')
check_answer("5.7", int(answ.values[0,100,100]*1000))