GRASS logo

NAME

r.learn.ml - Supervised classification and regression of GRASS rasters using the python scikit-learn package

KEYWORDS

raster, classification, regression, machine learning, scikit-learn

SYNOPSIS

r.learn.ml
r.learn.ml --help
r.learn.ml [-tnfrsipzmbl] group=name [trainingmap=name] [trainingpoints=name] [field=name] [output=name] [classifier=string] [c=float[,float,...]] [max_features=integer[,integer,...]] [max_depth=integer[,integer,...]] [min_samples_split=integer[,integer,...]] [min_samples_leaf=integer[,integer,...]] [n_estimators=integer[,integer,...]] [learning_rate=float[,float,...]] [subsample=float[,float,...]] [max_degree=integer[,integer,...]] [n_neighbors=integer[,integer,...]] [weights=string[,string,...]] [grid_search=string] [categorymaps=string[,string,...]] [cvtype=string] [n_partitions=integer] [group_raster=name] [cv=integer] [n_permutations=integer] [errors_file=name] [preds_file=name] [fimp_file=name] [param_file=name] [random_state=integer] [lines=integer] [indexes=integer[,integer,...]] [n_jobs=integer] [save_training=name] [load_training=name] [save_model=name] [load_model=name] [--overwrite] [--help] [--verbose] [--quiet] [--ui]

Flags:

-t
Perform hyperparameter tuning only
-n
Use nested cross validation
Use nested cross validation as part of hyperparameter tuning
-f
Estimate permutation-based feature importances
Estimate feature importance using a permutation-based method
-r
Make predictions for cross validation resamples
Produce raster predictions for all cross validation resamples
-s
Standardization preprocessing
-i
Impute training data preprocessing
-p
Output class membership probabilities
-z
Only predict class probabilities
-m
Build model only - do not perform prediction
-b
Balance training data using class weights
-l
Use memory swap
--overwrite
Allow output files to overwrite existing files
--help
Print usage summary
--verbose
Verbose module output
--quiet
Quiet module output
--ui
Force launching GUI dialog

Parameters:

group=name [required]
Imagery group to be classified
GRASS imagery group of raster maps to be used in the machine learning model
trainingmap=name
Labelled pixels
Raster map with labelled pixels for training
trainingpoints=name
Training point vector
Vector points map for training
field=name
Response attribute column
Name of attribute column in trainingpoints containing response value
output=name
Output Map
Prediction surface result from classification or regression model
classifier=string
Classifier
Supervised learning model to use
Options: LogisticRegression, LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis, KNeighborsClassifier, GaussianNB, DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier, GradientBoostingRegressor, SVC, EarthClassifier, EarthRegressor, XGBClassifier, XGBRegressor
Default: RandomForestClassifier
c=float[,float,...]
Inverse of regularization strength
Inverse of regularization strength (LogisticRegression and SVC)
Default: 1.0
max_features=integer[,integer,...]
Number of features avaiable during node splitting
Number of features avaiable during node splitting (tree-based classifiers and regressors)
Default: 0
max_depth=integer[,integer,...]
Maximum tree depth; zero uses classifier defaults
Maximum tree depth for tree-based method; zero uses classifier defaults (full-growing for Decision trees and Randomforest, 3 for GBM and XGB)
Default: 0
min_samples_split=integer[,integer,...]
The minimum number of samples required for node splitting
The minimum number of samples required for node splitting in tree-based classifiers
Default: 2
min_samples_leaf=integer[,integer,...]
The minimum number of samples required to form a leaf node
The minimum number of samples required to form a leaf node in tree-based classifiers
Default: 1
n_estimators=integer[,integer,...]
Number of estimators
Number of estimators (trees) in ensemble tree-based classifiers
Default: 100
learning_rate=float[,float,...]
learning rate
learning rate (also known as shrinkage) for gradient boosting methods
Default: 0.1
subsample=float[,float,...]
The fraction of samples to be used for fitting
The fraction of samples to be used for fitting, controls stochastic behaviour of gradient boosting methods
Default: 1.0
max_degree=integer[,integer,...]
The maximum degree of terms in forward pass
The maximum degree of terms in forward pass for Py-earth
Default: 1
n_neighbors=integer[,integer,...]
Number of neighbors to use
Number of neighbors to use
Default: 5
weights=string[,string,...]
weight function
weight function for knn prediction
Options: uniform, distance
Default: uniform
grid_search=string
Resampling method to use for hyperparameter optimization
Resampling method to use for hyperparameter optimization
Options: cross-validation, holdout
Default: cross-validation
categorymaps=string[,string,...]
Indices of categorical rasters within the imagery group (0..n)
Indices of categorical rasters within the imagery group (0..n) that will be one-hot encoded
cvtype=string
Non-spatial or spatial cross-validation
Perform non-spatial, clumped or clustered k-fold cross-validation
Options: non-spatial, clumped, kmeans
Default: Non-spatial
n_partitions=integer
Number of kmeans spatial partitions
Number of kmeans spatial partitions for kmeans clustered cross-validation
Default: 10
group_raster=name
Custom group ids for training samples from GRASS raster
GRASS raster containing group ids for training samples. Samples with the same group id will not be split between training and test cross-validation folds
cv=integer
Number of cross-validation folds
Default: 1
n_permutations=integer
Number of permutations to perform for feature importances
Default: 10
errors_file=name
Save cross-validation global accuracy results to csv
Name for output file
preds_file=name
Save cross-validation predictions to csv
Name for output file
fimp_file=name
Save feature importances to csv
Name for output file
param_file=name
Save hyperparameter search scores to csv
Name for output file
random_state=integer
Seed to use for random state
Default: 1
lines=integer
Processing block size in terms of number of rows
Default: 25
indexes=integer[,integer,...]
Indexes of class probabilities to predict. Default -1 predicts all classes
Default: -1
n_jobs=integer
Number of cores for multiprocessing, -2 is n_cores-1
Default: -2
save_training=name
Save training data to csv
Name for output file
load_training=name
Load training data from csv
Name of input file
save_model=name
Save model from file
Name for output file
load_model=name
Load model from file
Name of input file

Table of contents

DESCRIPTION

r.learn.ml represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using several commonly used classifiers in remote sensing and spatial modelling. The choice of classifier is set using the classifier parameter. For more details relating to the classifiers, refer to the scikit learn documentation. The following classification and regression methods are available:

The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be tuning using a grid-search by inputting multiple parameter settings as a comma-separated list. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the cv option to > 1. The parameters consist of:

In addition to model fitting and prediction, feature selection can be performed using the -f flag. The feature selection method employed is based on Brenning et al. (2012) and consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for n_permutations; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.

Cross validation can be performed by setting the cv parameters to > 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The cvtype parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into n_partitions by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the group_raster option.

Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The -s flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Additionally, most of the classifiers do not perform well if there is a large class imbalance in the training data. Using the -b flag balances the training data by weighting of the minority classes relative to the majority class. This does not apply to the Naive Bayes or LinearDiscriminantAnalysis classifiers.

Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the categorymaps option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the original imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.

The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.

For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the save_training option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.

NOTES

r.learn.ml uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from Christoph Gohlke and by using the OSGeo4W installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.

r.learn.ml is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the lines parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. lines=25 should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the -l flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the available memory.

Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the randst parameter.

EXAMPLE

Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.

Landsat 7 (2000) bands 7,4,2 color composite example:

Landsat 7 (2000) bands 7,4,2 color composite example

Note that this example must be run in the "landsat" mapset of the North Carolina sample data set location.

First, we are going to generate some training pixels from an older (1996) land cover classification:

g.region raster=landclass96 -p
r.random input=landclass96 npoints=1000 raster=landclass96_roi

Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:

r.learn.ml group=lsat7_2000 trainingmap=landclass96_roi output=rf_classification \
  classifier=RandomForestClassifier n_estimators=500 randst=1 lines=25

# copy category labels from landclass training map to result
r.category rf_classification raster=landclass96_roi

# copy color scheme from landclass training map to result
r.colors rf_classification raster=landclass96_roi
r.category rf_classification

Random forest classification result:

Random forest classification result

ACKNOWLEDGEMENTS

Thanks for Paulo van Breugel and Vaclav Petras for general testing, and Paulo for the suggestion to enable saving of the fitted models.

REFERENCES

Brenning, A. 2012. Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: the R package 'sperrorest'. 2012 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 23-27 July 2012, p. 5372-5375.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

AUTHOR

Steven Pawley

Last changed: $Date: 2017-05-03 22:52:05 +0200 (Wed, 03 May 2017) $

SOURCE CODE

Available at: r.learn.ml source code (history)


Main index | Raster index | Topics index | Keywords index | Graphical index | Full index

© 2003-2017 GRASS Development Team, GRASS GIS 7.2.3svn Reference Manual