GRASS GIS manual: v.class.ml

NAME

v.class.ml - Classification of a vector maps based on the values in attribute tables

KEYWORDS

vector, classification, machine learning

SYNOPSIS

v.class.ml

v.class.ml --help

v.class.ml [-enfbocrtvxda] vector=name [vtraining=name] [vlayer=string] [tlayer=string] [rlayer=string] [npy_data=string] [npy_cats=string] [npy_cols=string] [npy_index=string] [npy_tdata=string] [npy_tclasses=string] [npy_btdata=string] [npy_btclasses=string] [imp_csv=string] [imp_fig=string] [scalar=string[,string,...]] [decomposition=string] [n_training=integer] [pyclassifiers=string] [pyvar=string] [pyindx=string] [pyindx_optimize=string] [nan=string[,string,...]] [inf=string[,string,...]] [neginf=string[,string,...]] [posinf=string[,string,...]] [csv_test_cls=string] [report_class=string] [svc_c_range=float[,float,...]] [svc_gamma_range=float[,float,...]] [svc_kernel_range=string[,string,...]] [svc_poly_range=string[,string,...]] [svc_n_jobs=integer] [svc_c=float] [svc_gamma=float] [svc_kernel=string] [svc_img=string] [rst_names=string] [--overwrite] [--help] [--verbose] [--quiet] [--ui]

Flags:

-e: Extract the training set from the vtraining map
-n: Export to numpy files
-f: Feature importances using extra trees algorithm
-b: Balance the training using the class with the minor number of data
-o: Optimize the training samples
-c: Classify the whole dataset
-r: Export the classify results to raster maps
-t: Test different classification methods
-v: add to test to compute the Bias variance
-x: add to test to compute extra parameters like: confusion matrix, ROC, PR
-d: Explore the SVC domain
-a: append the classification results
--overwrite: Allow output files to overwrite existing files
--help: Print usage summary
--verbose: Verbose module output
--quiet: Quiet module output
--ui: Force launching GUI dialog

Parameters:

vector=name [required]: Name of vector map; Name of input vector map
vtraining=name: Name of vector map; Name of training vector map
vlayer=string: layer name or number to use for data
tlayer=string: layer number/name for the training layer
rlayer=string: layer number/name for the ML results
npy_data=string: Data with statistics in npy format.; Default: data.npy
npy_cats=string: Numpy array with vector cats.; Default: cats.npy
npy_cols=string: Numpy array with columns names.; Default: cols.npy
npy_index=string: Boolean numpy array with training indexes.; Default: indx.npy
npy_tdata=string: training npy file with training set, default: training_data.npy; Default: training_data.npy
npy_tclasses=string: training npy file with the classes, default: training_classes.npy; Default: training_classes.npy
npy_btdata=string: training npy file with training set, default: training_data.npy; Default: Xbt.npy
npy_btclasses=string: training npy file with the classes, default: training_classes.npy; Default: Ybt.npy
imp_csv=string: CSV file name with the feature importances rank using extra tree algorithms; Default: features_importances.csv
imp_fig=string: Figure file name with feature importances rank using extra tree algorithms; Default: features_importances.png
scalar=string[,string,...]: scaler method, center the data before scaling, if no, not scale at all; Default: with_mean,with_std
decomposition=string: choose a decomposition method (PCA, KernelPCA, ProbabilisticPCA, RandomizedPCA, FastICA, TruncatedSVD) and set the parameters using the | to separate the decomposition method from the parameters like: PCA|n_components=98; Default:
n_training=integer: Number of random training per class to training the machine learning algorithms
pyclassifiers=string: a python file with classifiers
pyvar=string: name of the python variable that must be a list of dictionary
pyindx=string: specify the index or range of index of the classifiers that you want to use
pyindx_optimize=string: Index of the classifiers to optimize the training set
nan=string[,string,...]: Column pattern:Value or Numpy funtion to use to substitute NaN values; Default: *_skewness:nanmean,*_kurtosis:nanmean
inf=string[,string,...]: Key:Value or Numpy funtion to use to substitute Inf values; Default: *_skewness:nanmean,*_kurtosis:nanmean
neginf=string[,string,...]: Key:Value or Numpy funtion to use to substitute neginf values; Default:
posinf=string[,string,...]: Key:Value or Numpy funtion to use to substitute posinf values; Default:
csv_test_cls=string: csv file name with results of different machine learning scores; Default: test_classifiers.csv
report_class=string: text file name with the report of different machine learning algorithms; Default: classification_report.txt
svc_c_range=float[,float,...]: C value range list to explore SVC domain; Default: 1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5,1e6,1e7,1e8
svc_gamma_range=float[,float,...]: gamma value range list to explore SVC domain; Default: 1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4
svc_kernel_range=string[,string,...]: kernel value range list to explore SVC domain; Default: linear,poly,rbf,sigmoid
svc_poly_range=string[,string,...]: polynomial order list to explore SVC domain; Default:
svc_n_jobs=integer: number of jobs to use during the domain exploration; Default: 1
svc_c=float: definitive C value
svc_gamma=float: definitive gamma value
svc_kernel=string: definitive kernel value. Available kernel are: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’; Default: rbf
svc_img=string: filename pattern with the image of SVC parameter; Default: domain_%s.svg
rst_names=string: filename pattern for raster; Default: %s

DESCRIPTION
- Flags:
- Input parameters:
SEE ALSO
AUTHOR

DESCRIPTION

v.class.ml uses machine-learning algorithms to classify a vector maps based on the values of its attribute table. The module uses different machine-learning libraries available for python at the moment uses: scikit-learn (package name may be "python-scikit-learn") and MLPY, but should be possible to add easily other python libraries. The module is though to be use in a modular way, using the flags it is possible to define which independent tasks should be execute.

Flags:

-e: Extract the training set from a vector map (vtraining).
-n: Store: attribute table data, columns names, categories training data, training index to a numpy binary files.
-f: Rank feature importances using a ExtraTreesClassifier algorithm.
-b: Balance the training using the class with the minor number of training samples or the parameter set in n_training.
-o: Optimize a balanced training dataset using the class with the minor number of training samples or the parameter set in n_training.
-c: Classify the whole dataset.
-r: Export machine-learning results to raster maps.
-t: Test several machine-learning algorithms on your dataset.
-v: Test also the bias variance.
-x: Compute also extra parameters to evaluate different algorithms like: confusion matrix, ROC, PR.
-d: Explore the Support Vector Classification (SVC) domain.

Input parameters:

The vector parameter is the input vector map. The input vector map must be prepared with v.category to copy the categories to all the layers that will be created.

The vtraining parameter is a vector input map that can be used to select the training areas. Currently only supervised classification is implemented so this parameter is mandatory. The training vector map can be generated using the GRASS standard tool for supervised classification g.gui.iclass.

The vlayer parameter is the layer name or number of the attribute tables with the data that must be used as input for the machine-learning algorithms.

The tlayer parameter is the layer name or number of the attribute tables where are or will be stored the training data for the machine-learning algorithms.

The rlayer parameter is the layer name or number the attribute tables where will be stored the machine-learning results.

The npy_data parameter is a string with the path to define where the binary numpy files containing the complete dataset will be saved.

The npy_cats parameter is a string with the path to define where the binary numpy files containing the vector categories will be saved.

The npy_cols parameter is a string with the path to define where the binary numpy files containing the column names of the data attribute table will be saved.

The npy_index parameter is a string with the path to define where the binary numpy files containing a boolean array to say if the category is used or not as training.

The npy_tdata parameter is a string with the path to define where the binary numpy files containing a training data array will be saved.

The npy_tclasses parameter is a string with the path to define where the binary numpy files containing the training classes will be saved.

The npy_btdata parameter as npy_tdata but only for a balance dataset.

The npy_btclasses parameter as npy_tclasses but only for a balance dataset.

The imp_csv parameter is a string with the path to define where a CSV file containing the rank of the feature importances should be save.

The imp_fig parameter is a string with the path to define where a figure file containing the rank of the feature importances should be save.

The scalar parameter is a string with scaler methods that will be apply to pre-process the data. Two main methods are available: with_mean, with_std. This is a quite common task therefore the default parameter apply both methods.

The decomposition parameter is a string with scaler methods that will be apply to pre-process the data. The main decomposition methods available are: PCA, KernelPCA, ProbabilisticPCA, RandomizedPCA, FastICA, TruncatedSVD. Each of this methods could take several parameters. Use "|" as separator between the decomposition method name and its options, using the "," to separate the options. For examples imagine that we want to decompose using the KernelPCA method with 10 number of components and using a linear kernel, so the correct string is: "KernelPCA|n_components=10,kernel=linear"

The n_training parameter is an integer with the number of training that must be use per class. Some machine-learning methods are sensitive if the training dataset is balanced or not. As default all the training will be used.

The pyclassifiers parameter is a file path to a python file containing a list of dictionary to define classifiers class and options. See an example of the default classifiers used by the v.class.ml module.

The pyvar parameter is a string with the python variable name defined in the pyclassifiers file.

The pyindx parameter is a string with the indexes of the classifiers that will be used. In the string you could define a range using the minus character, or list the index usig the comma as separator, or combine both options together. For example: '1-5,34-36,40' it means that only classifiers with index: 1, 2, 3, 4, 5, 34, 35, 36 and 40 will be used.

The pyindx_optimize parameter is a integer with the classifier index that will be used to optimize a balance training dataset. This option is used only if optimize is true otherwise will be ignored.

The nan parameter is a string that allows user to define for each column in the attribute table which value or function should be used to substitute NaN values. The syntax could be: 'col0:9999,col1:9999'. The column name could be also a pattern, so it is possible to define a rule like: '*_mean:nanmean,*_max:nanmax' that substitute in all the columns that finish with '_mean' the mean value of the column and for column that end with '_max' the maximum value. This operation is needed because machine-learning algorithms are not able to handle nan, inf, neginf, and posinf values.

The inf parameter is similar to nan, but instead of substituting nan values the rules will be applied for infinite values.

The neginf parameter is similar to nan, but instead of substituting nan values the rules will be applied for negative infinite values.

The posinf parameter is similar to nan, but instead of substituting nan values the rules will be applied for positive infinite values.

The csv_test_cls parameter is the file name/path where the results of the classification test will be written.

The report_class parameter is the file name/path where a summary for each machine learning algorithms will be written.

The svc_c_range parameter is a range of C values that will be used when exploring the domain of the Support Vector Machine algorithms.

The svc_gamma_range parameter is a range of gamma values that will be used when exploring the domain of the Support Vector Machine algorithms.

The svc_kernel_range parameter is a range of kernel values that will be used when exploring the domain of the Support Vector Machine algorithms.

The svc_n_jobs parameter is an integer with the number of process that will be used during the domain exploration of Support Vector Machine algorithms.

The svc_img parameter is the file name/path pattern of the image that will be generated from the domain exploration.

The svc_c parameter is the definitive C value that will be used for final classification.

The svc_gamma parameter is the definitive gamma value that will be used for final classification.

The svc_kernel parameter is the definitive kernel value that will be used for final classification.

The rst_names parameter is the name pattern that will be use to generate the output raster map for each algorithm.

AUTHOR

Pietro Zambelli, University of Trento

Last changed: $Date: 2015-10-13 03:30:35 +0200 (Tue, 13 Oct 2015) $

SOURCE CODE

Available at: v.class.ml source code (history)

NAME

KEYWORDS

SYNOPSIS

Flags:

Parameters:

Table of contents

DESCRIPTION

Flags:

Input parameters:

SEE ALSO

AUTHOR

SOURCE CODE