1 Introduction

In this module, we will discuss the following concepts:

  1. The difference between supervised and unsupervised image classification.
  2. The definitions and application of the various classification algorithms available through Google Earth Engine.
  3. How to set up and run a classification using randomForest, with aspen presence and absence as an example dataset.

2 Background

Image Classification
Humans naturally tend to organize spatial information into groups. From above, we recognize common landforms like lakes and rivers, buildings and roads, forests and deserts. We call this grouping of objects by similar traits, “Image Classification”. But manually classifying objects and assigning values across the globe would be an unending task. Thankfully, the use of remotely sensed data to delineate varying landscape features into categorical classes has become a staple of ecological research over the past 40 years. Classifications have been performed for everything from agricultural development and land cover change, to silvicultural practices and pollution monitoring.

Unsupervised vs. Supervised
Image classification methods can be divided into two categories. First, unsupervised classification involves applying potential predictor variables to a geographic region and asking the predictive algorithm or a priori regression coefficients to do the work of image classification. The second, supervised classification, requires the creation of independent training data: information that a probabilistic model can use to find associations between observed conditions and a suite of predictor variables.

Google Earth Engine Classifiers
Of the available options in Google Earth Engine’s ee.Classifier() function, several fall under the general category of “machine learning”. The algorithmic functions “learn” from the data fed to them and make predictions based on that learned information. These classifiers are particularly adept at building statistical models from relationships between large numbers of remotely sensed predictors and (often highly non-linear) training data. The models can then be applied across large spatial extents to generate predictions in the form of map outputs. In recent years, classifiers such as classification and regression trees (CART) and randomForest have been imported from the computer science and statistics communities and into ecological research.

randomForest
One commonly utilized algorithm available in Google Earth Engine for supervised classification is randomForest (Breiman, 2001). In a nutshell, a randomForest (RF) model is constructed by taking a randomized subset of training data (i.e. field measurements, weather station recordings) and fitting them to a random subset of predictors (i.e. remotely sensed data) in decision trees. While no single tree perfectly captures the statistical relationships between training and predictor data, the compilation of trees, the forest, tells a more complete story. If that sounds complicated, do not worry! The classifier is going to do all that hard work behind the scenes. But in using powerful tools, like RF, it is our responsibility to know what kind of information it is receiving. Much like the human body, if you give RF low-quality information for the training data and predictors, your outputs will reflect that quality. So, know thy data!

3 Google Earth Engine Image Classification Workflow

In this module, we are going to walk through a sample modeling workflow in Google Earth Engine. We will see how to set up our initial dataset, build a list of potential predictors, run our RF classifier, apply the resulting model to a larger spatial extent, and assess the accuracy of our RF model. For this example, we are going to use RF to classify aspen stands, which provide multiple ecosystem services from wood product stock to highly biodiverse wildlife habitat in western Colorado, USA.

3.1 Setting Up Our Imagery

At this point, our previous modules have covered much of the concepts necessary to load the data (Module 1), filter the data (Module 2), and create a basic cloud mask (Module 5). One small difference in the code below is that our cloud masking requires a tweak to be specific to Landsat 8. Running the code below limits our base dataset to fall imagery to capture the unique yellow foliage of aspen.

Start a new script with the code below and your map pane should produce the image below.

// Import and filter Landsat 8 surface reflectance data.
var LS8_SR1 = ee.ImageCollection('LANDSAT/LC08/C01/T1_SR')
  .filterDate('2015-08-01', '2015-11-01') //new date
  .filter(ee.Filter.eq('WRS_PATH', 35))
  .filter(ee.Filter.eq('WRS_ROW', 33))
  .filterMetadata('CLOUD_COVER', 'less_than', 20);

// Create true color visualization parameters 
// to take an initial look at the study area.
var visTrueColor = {bands: ["B4","B3","B2"], max:2742, min:0};
Map.addLayer(LS8_SR1, visTrueColor, 'LS8_SR1', false);
Map.centerObject(ee.Geometry.Point(-107.8583, 38.8893), 8);

// Define a cloud mask function specific to Landsat 8.
var maskClouds = function(image){
  var clear = image.select('pixel_qa').bitwiseAnd(2).neq(0);    
  return image.updateMask(clear);   
};

// Apply the cloud mask function to the previously filtered image 
// collection and calculate the median.
var LS8_SR2 = LS8_SR1
  .map(maskClouds)
  .median();
Map.addLayer(LS8_SR2, visTrueColor, 'LS8_SR2 - masked')

Visualizing our initial median Landsat image of western Colorado, USA with clouds removed.

3.2 Making a Predictor List

Now we can start building our list of predictors. You may read about classification algorithms that can handle predictor lists with “high-dimensionality”. This simply means a large number of potential explanatory variables can be included. Based on existing knowledge of the ecosystem we are studying, we can select an initial set of variables that we hypothesize could explain and predict aspen presence on the landscape. Appending the code below to your script will construct a multiband image containing all our desired predictors, including some spectral indices related to vegetation. Printing the predictors object should result in a console output shown below the code.

// First define individual bands as variables.
var red = LS8_SR2.select('B4').rename("red")
var green= LS8_SR2.select('B3').rename("green")
var blue = LS8_SR2.select('B2').rename("blue")
var nir = LS8_SR2.select('B5').rename("nir")
var swir1 = LS8_SR2.select('B6').rename("swir1")
var swir2 = LS8_SR2.select('B7').rename("swir2")

// Then, calculate three different vegetation indices: NDVI, NDWI, and TCB.
var ndvi = nir.subtract(red).divide(nir.add(red)).rename('ndvi');
var ndwi = green.subtract(nir).divide(green.add(nir)).rename('ndwi');
var TCB = LS8_SR2.expression(
  "0.3029 * B2 + 0.2786 * B3 + 0.4733 * B4 + 0.5599 * B5 + 0.508 * B6 + 0.1872 * B7" , {
  'B2': blue,
  'B3': green,
  'B4': red,
  'B5': nir,
  'B6': swir1,
  'B7': swir1
  }).rename("TCB");

// Combine the predictors into a single image.
var predictors = nir
.addBands(blue)
.addBands(green)
.addBands(red)
.addBands(swir1)
.addBands(swir2)
.addBands(ndvi)
.addBands(TCB)
.addBands(ndwi)

print('predictors: ', predictors);


Text from the print statement of the predictors object, listing each band with the corresponding predictor name.

3.3 Loading Training Data

To extract values to points, we must first import our point dataset. Luckily, we have one ready to go that can be called directly with a specific FeatureCollection ID. The points indicate areas of aspen stand presence and absence, so we have named the variable PA. Imported training data can be much more complicated, but for our purposes a simple binary classification will do the trick. Once we load our training data, we will need to extract the values from our predictors at each point.

Adding the code below to our existing script, we can see that our training data have been loaded. Feel free to tweak the colors to your personal preference but the result should appear similar to the one shown below.

var PA = ee.FeatureCollection('users/GDPE-GEE/Module7_PresAbs');
Map.addLayer(PA.style({color: 'red', pointSize: 3, width: 1, fillColor: 'white'}),{}, 'Merged_Presence_Absence');

var samples = predictors.sampleRegions({
  collection: PA,
  properties: ['presence'],
  scale: 30 });

Point data containing information on the presence and absence locations of aspen stands.

3.4 Building the Model

Our training data now consist of the reflectance values (from our spatial data variables) as recorded for each point location. This is what our RF model will use to learn where aspen does and does not occur. As we do so, it is important to understand that classifier algorithms in Google Earth Engine should be considered initial explorations of the potential for remote sensing to enhance your work. Why is this? Let us look at one of the parameters you can adjust in the RF classifier: numberOfTrees. Here, we have kept this number very low to load your model results relatively quickly. Increasing this number from 10 to, say, 1000, will result in Google Earth Engine taking a very long time to process. The problem with limiting the numberOfTrees parameter is that research has shown larger numbers of trees will generate more statistically robust RF models (i.e. Evans and Cushman, 2009). As the famous saying by George Box goes, “all models are wrong but some are useful” and it is good to know what the caveats of using a particular system and algorithm.

// Using the sampled data, build a randomForest model.
// Using a specific seed (random number) exactly replicates your model each time you run it.
var trainingclassifier = ee.Classifier.smileRandomForest({
                  numberOfTrees: 10,
                  seed: 7})
.train({
features: samples,
classProperty: 'presence'});

print(trainingclassifier);

The results from the console tab after printing the trainingclassifer object. Note that we can verify which model options we have chosen because the values are printed for numberOfTrees and seed.

3.5 Accuracy Assessment

After acknowledging the caveats of parameter limits in Google Earth Engine, it is still a good idea to know how much we can trust the result of our model before we use it to make any predictions. One way to assess the accuracy of the classifier is to look at the confusion matrix. Just remember that this is only measuring the accuracy of our training data!

Append the code below to your script and re-run to produce the Console output as shown below. It is not the prettiest of visualizations but it gets the point across that this appears to be a highly accurate model of aspen presence and absence.

// Print Confusion Matrix and Overall Accuracy.
var confusionMatrix = trainingclassifier.confusionMatrix();
print('Error matrix: ', confusionMatrix);
print('Training overall accuracy: ', confusionMatrix.accuracy());

       


The Console tab results from our randomForest model, including the confusion matrix and overall accuracy (left) with an explanatory diagram for the confusion matrix results (right).

3.6 Applying the Model

There are no hard and fast rules for acceptable model accuracy. It will depend on your dataset, study area, and expectations set in the literature. That being said, our model was produced with very high accuracy, so we can feel reasonably comfortable using our model to make predictions across the landscape. Regardless of model accuracy, ecological knowledge can also help guide the interpretation of model results. Predictions of trees should not occur above treeline nor alpine vegetation in canyon bottoms. Visual inspection of model outputs is always recommended whether you are an expert in your field or just using common sense. Append the final piece of code to complete the modeling walkthrough and view your predicted results.

// Apply the model to the extent of the loaded predictor image.
var classified = predictors.classify(trainingclassifier);
Map.addLayer(classified, {min:0, max:1, palette:['white', 'blue']}, 'classified')


The results of using our randomForest model to make predictions across the landscape. Aspen presence is indicated in blue and absence in white.

4 Conclusion

In this module, we have provided an introduction to image classification in Google Earth Engine. We discussed some of the basic definitions and general features of classification methods, including a machine learning algorithm called randomForest. We then employed randomForest to help us generate a landscape-scale prediction of aspen presence and absence in western Colorado, USA, by combining information from remotely sensed predictors and field data.

5 Complete Code

// Import and filter Landsat 8 surface reflectance data.
var LS8_SR1 = ee.ImageCollection('LANDSAT/LC08/C01/T1_SR')
  .filterDate('2015-08-01', '2015-11-01') //new date
  .filter(ee.Filter.eq('WRS_PATH', 35))
  .filter(ee.Filter.eq('WRS_ROW', 33))
  .filterMetadata('CLOUD_COVER', 'less_than', 20);

// Create true color visualization parameters 
// to take an initial look at the study area.
var visTrueColor = {bands: ["B4","B3","B2"], max:2742, min:0};
Map.addLayer(LS8_SR1, visTrueColor, 'LS8_SR1', false);
Map.centerObject(ee.Geometry.Point(-107.8583, 38.8893), 9);

// Define a cloud mask function specific to Landsat 8.
var maskClouds = function(image){
  var clear = image.select('pixel_qa').bitwiseAnd(2).neq(0);    
  return image.updateMask(clear);   
};

// Apply the cloud mask function to the previously filtered image 
// collection and calculate the median.
var LS8_SR2 = LS8_SR1
  .map(maskClouds)
  .median();
Map.addLayer(LS8_SR2, visTrueColor, 'LS8_SR2 - masked');

// First define individual bands as variables.
var red = LS8_SR2.select('B4').rename("red")
var green= LS8_SR2.select('B3').rename("green")
var blue = LS8_SR2.select('B2').rename("blue")
var nir = LS8_SR2.select('B5').rename("nir")
var swir1 = LS8_SR2.select('B6').rename("swir1")
var swir2 = LS8_SR2.select('B7').rename("swir2")

// Then, calculate three different vegetation indices: NDVI, NDWI, and TCB.
var ndvi = nir.subtract(red).divide(nir.add(red)).rename('ndvi');
var ndwi = green.subtract(nir).divide(green.add(nir)).rename('ndwi');
var TCB = LS8_SR2.expression(
  "0.3037 * B2 + 0.2793 * B3 + 0.4743 * B4 + 0.5585 * B5 + 0.5082 * B6 + 0.1863 * B7" , {
  'B2': blue,
  'B3': green,
  'B4': red,
  'B5': nir,
  'B6': swir1,
  'B7': swir1
  }).rename("TCB");

// Combine the predictors into a single image.
var predictors = nir
.addBands(blue)
.addBands(green)
.addBands(red)
.addBands(swir1)
.addBands(swir2)
.addBands(ndvi)
.addBands(TCB)
.addBands(ndwi)

print('predictors: ', predictors);

// Load the field sampling locations.
var PA = ee.FeatureCollection('users/GDPE-GEE/Module7_PresAbs');
Map.addLayer(PA.style({color: 'red', pointSize: 3, width: 1, fillColor: 'white'}),{}, 'Merged_Presence_Absence');

// Determine the values of each predictor at each training data location.
var samples = predictors.sampleRegions({
  collection: PA,
  properties: ['presence'],
  scale: 30 });

// Using the sampled data, build a randomForest model.
// Using a specific seed (random number) exactly replicates your model each time you run it.
var trainingclassifier = ee.Classifier.smileRandomForest({
                  numberOfTrees: 10,
                  seed: 7})
.train({
features: samples,
classProperty: 'presence'});

print(trainingclassifier);

// Print Confusion Matrix and Overall Accuracy.
var confusionMatrix = trainingclassifier.confusionMatrix();
print('Error matrix: ', confusionMatrix);
print('Training overall accuracy: ', confusionMatrix.accuracy());

// Apply the model to the extent of the loaded predictor image.
var classified = predictors.classify(trainingclassifier);
Map.addLayer(classified, {min:0, max:1, palette:['white', 'blue']}, 'classified')

6 References

Breiman, Leo. “Random Forests.” Machine Learning 45, no. 1 (2001): 5-32. https://doi.org/10.1023/A:1010933404324

Evans, Jeffrey S., and Samuel A. Cushman. “Gradient Modeling of Conifer Species Using Random Forests.” Landscape Ecology 24, no. 5 (May 2009): 673-83. https://doi.org/10.1007/s10980-009-9341-0