Categorical and missing data
This tutorial explores further concepts in Numpy such as, categorical data, advanced indexing and dealing with Not-a-Number (NaN) data.
Before we start with this tutorial, let's have a quick look at a data structure in Python called dictionary. This will help us understand some of the materials in the tutorial and also will help to introduce XArray later on.
A dictionary represents a mapping between keys and values. The keys and values are Python objects of any type. We declare a dictionary using curly braces. Inside we can specify the keys and values using :
as a separator and and commas to separate elements in the dictionary. For example:
d = {1: 'one',
2: 'two',
3: 'tree'}
print(d[1], " + ", d[2], " = ", d[3])
d[3] = 'three'
d[4] = 'four'
print(d[1], " + ", d[2], " = ", d[3])
%matplotlib inline
import numpy as np
import imageio
from matplotlib import pyplot as plt
from matplotlib import colors
from check_answer import check_answer
Categorical data: sometimes remote sensing is used to create classification products. These products do not contain continuous values. They use discrete values to represent the different classes individual pixels can belong to.
As an example, the following cell simulates a very simple image containing three different land cover types. Value 1
represents area covered with grass, 2
croplands and 3
city.
# grass = 1
area = np.ones((100,100))
# crops = 2
area[10:60,20:50] = 2
# city = 3
area[70:90,60:80] = 3
area.shape, area.dtype, np.unique(area)
# We map the values to colours
index = {1: 'green', 2: 'yellow', 3: 'grey'}
# Create a discrete colour map
cmap = colors.ListedColormap(index.values())
# Plot
plt.imshow(area, cmap=cmap)
area?
index? = ?
# Regenerate discrete colour map
cmap = colors.ListedColormap(index.values())
# Plot
plt.imshow(area, cmap=cmap)
check_answer("4.1.1", area[20,30]), check_answer("4.1.2", index[4])
im = imageio.imread('data/land_mask.png')
plt.imshow(im)
In remote sensing analysis it's common to be interested in analysing certain features from the Earth surface such as vegetation. Clouds, cloud shadows and even water bodies need to be normally removed or 'masked' in order to process the data.
For this example, we have three files containing numpy arrays .npy
which represent the masks to filter clouds, shadows and water from our image.
import matplotlib.gridspec as gridspec
plt.figure(figsize=(12,8))
gs = gridspec.GridSpec(1,3) # set up a 1 x 3 grid of images
ax1=plt.subplot(gs[0,0])
water_mask = np.load("data/water_mask.npy")
plt.imshow(water_mask)
ax1.set_title('Water Mask')
ax2=plt.subplot(gs[0,1])
cloud_mask = np.load("data/cloud_mask.npy")
plt.imshow(cloud_mask)
ax2.set_title('Cloud Mask')
ax3=plt.subplot(gs[0,2])
shadow_mask = np.load("data/shadow_mask.npy")
plt.imshow(shadow_mask)
ax3.set_title('Shadow Mask')
plt.show()
These masks are stored as dtype=uint8
using 1
to indicate presence and 0
for absence of each feature.
Exercise 4.2: Can you use the water mask to set all the pixels in the image array representing water to 0?
Tip: Remember that boolean arrays can be used to index and select regions of another array. To complete this exercise you will need to convert the previous water mask array into boolean types before you can use it.
# 1.- Load the image
answ = imageio.imread('data/land_mask.png')
# 2.- Create a boolean version of the water_mask array
bool_water_mask = ?
# 3.- Use the previous boolean array to set all pixels in the answ array to 0
answ[?] = ?
# You should see the region with water white
plt.imshow(answ)
check_answer("4.2", answ[200,200])
# 1.- Load the image
answ = imageio.imread('data/land_mask.png')
# 2.- Create boolean versions of the masks
bool_water_mask = ?
bool_cloud_mask = ?
bool_shadow_mask = ?
# 3.- Use the previous boolean arrays to set all pixels in the answ array to 0 (You might need more than one line)
answ[?] = ?
# You should see just green and all the other regions white
plt.imshow(answ)
check_answer("4.3", answ[200,200]+answ[100,100]+answ[100,180]+answ[0,0])
mask = water_mask*1 + cloud_mask*2 + shadow_mask*3
plt.imshow(mask)
But this way of representing categories is not very convenient for the case when we can have pixels that can belong to two or more categories at the same time. For example, if we have a pixel that is classified as a cloud shadow and water at the same time, we would need to come up with a new category to represent this case.
Instead, it's a common practice to use bit flags to create these masking or pixel quality products. Bit flags use the binary representation of a number (using 0s and 1s) to encode the different categories. For example a uint8 number can store values in the range [0-255] and is internally represented with 8 bits which can be either 0 or 1.
In our previous case we could have used the following encoding:
- Bit 0: Water
00000001
-> 1 - Bit 1: Cloud
00000010
-> 2 - Bit 2: Shadow
00000100
-> 4
So, if one pixel is both classified as shadow and water, this pixel would be encoded by the value 5
:
00000101
-> 5
Exercise 4.4: How would you represent a pixel that is a cloud and a shadow at the same time?
answ = ?
# Print binary format of answ
print(f"{answ:08b}")
check_answer("4.4", answ)
pq = imageio.imread('data/LC08_L1TP_112084_20190820_20190902_01_T1_BQA.tiff')
plt.imshow(pq)
pq.shape, pq.dtype, np.unique(pq)
"{:016b}".format(2720)
print("{:016b}".format(2976))
answ = ?# Choose one of "None", "Low", "Medium", "High"
check_answer("4.5", answ)
arr = np.array([1,2,3,4,5,np.nan,7,8,9], dtype=np.float32)
arr
print(np.mean(arr))
print(np.nanmean(arr))
We have been previously filtering out water and cloud effects from images by setting the pixels to 0
. However, if we are interested in performing statistics to summarise the information in the image, this could be problematic. For example, consider the following uint16 array in which the value 0
designates no data. If we want to compute the mean of all the valid values, we can do converting the array to float type and then assigning the value 0
to NaN.
arr = np.array([234,243,0,231,219,0,228,220,237], dtype=np.uint16)
print("0s mean:", np.mean(arr))
arr = arr.astype(np.float32)
arr[arr==0]=np.nan
print("NaNs mean:", np.nanmean(arr))
# 1.- Load the image
im = imageio.imread('data/land_mask.png')
# 2.- Select green channel
im = ?
# 3.- Change the type of im to float32
im = ?
# 4.- Use the previous boolean array to set all pixels other than grass to NaN
im?
# You should see the all NaN regions white
plt.imshow(im)
# 5.- Calculate the mean value
answ = ?
check_answer("4.6", int(answ))