NAME
v.class.ml - Classification of a vector maps based on the values in attribute tables
KEYWORDS
vector,
classification,
machine learning
SYNOPSIS
v.class.ml
v.class.ml --help
v.class.ml [-enfbocrtvxda] vector=name [vtraining=name] [vlayer=string] [tlayer=string] [rlayer=string] [npy_data=string] [npy_cats=string] [npy_cols=string] [npy_index=string] [npy_tdata=string] [npy_tclasses=string] [npy_btdata=string] [npy_btclasses=string] [imp_csv=string] [imp_fig=string] [scalar=string[,string,...]] [decomposition=string] [n_training=integer] [pyclassifiers=string] [pyvar=string] [pyindx=string] [pyindx_optimize=string] [nan=string[,string,...]] [inf=string[,string,...]] [neginf=string[,string,...]] [posinf=string[,string,...]] [csv_test_cls=string] [report_class=string] [svc_c_range=float[,float,...]] [svc_gamma_range=float[,float,...]] [svc_kernel_range=string[,string,...]] [svc_poly_range=string[,string,...]] [svc_n_jobs=integer] [svc_c=float] [svc_gamma=float] [svc_kernel=string] [svc_img=string] [rst_names=string] [--overwrite] [--help] [--verbose] [--quiet] [--ui]
Flags:
- -e
- Extract the training set from the vtraining map
- -n
- Export to numpy files
- -f
- Feature importances using extra trees algorithm
- -b
- Balance the training using the class with the minor number of data
- -o
- Optimize the training samples
- -c
- Classify the whole dataset
- -r
- Export the classify results to raster maps
- -t
- Test different classification methods
- -v
- add to test to compute the Bias variance
- -x
- add to test to compute extra parameters like: confusion matrix, ROC, PR
- -d
- Explore the SVC domain
- -a
- append the classification results
- --overwrite
- Allow output files to overwrite existing files
- --help
- Print usage summary
- --verbose
- Verbose module output
- --quiet
- Quiet module output
- --ui
- Force launching GUI dialog
Parameters:
- vector=name [required]
- Name of vector map
- Name of input vector map
- vtraining=name
- Name of vector map
- Name of training vector map
- vlayer=string
- layer name or number to use for data
- tlayer=string
- layer number/name for the training layer
- rlayer=string
- layer number/name for the ML results
- npy_data=string
- Data with statistics in npy format.
- Default: data.npy
- npy_cats=string
- Numpy array with vector cats.
- Default: cats.npy
- npy_cols=string
- Numpy array with columns names.
- Default: cols.npy
- npy_index=string
- Boolean numpy array with training indexes.
- Default: indx.npy
- npy_tdata=string
- training npy file with training set, default: training_data.npy
- Default: training_data.npy
- npy_tclasses=string
- training npy file with the classes, default: training_classes.npy
- Default: training_classes.npy
- npy_btdata=string
- training npy file with training set, default: training_data.npy
- Default: Xbt.npy
- npy_btclasses=string
- training npy file with the classes, default: training_classes.npy
- Default: Ybt.npy
- imp_csv=string
- CSV file name with the feature importances rank using extra tree algorithms
- Default: features_importances.csv
- imp_fig=string
- Figure file name with feature importances rank using extra tree algorithms
- Default: features_importances.png
- scalar=string[,string,...]
- scaler method, center the data before scaling, if no, not scale at all
- Default: with_mean,with_std
- decomposition=string
- choose a decomposition method (PCA, KernelPCA, ProbabilisticPCA, RandomizedPCA, FastICA, TruncatedSVD) and set the parameters using the | to separate the decomposition method from the parameters like: PCA|n_components=98
- Default:
- n_training=integer
- Number of random training per class to training the machine learning algorithms
- pyclassifiers=string
- a python file with classifiers
- pyvar=string
- name of the python variable that must be a list of dictionary
- pyindx=string
- specify the index or range of index of the classifiers that you want to use
- pyindx_optimize=string
- Index of the classifiers to optimize the training set
- nan=string[,string,...]
- Column pattern:Value or Numpy funtion to use to substitute NaN values
- Default: *_skewness:nanmean,*_kurtosis:nanmean
- inf=string[,string,...]
- Key:Value or Numpy funtion to use to substitute Inf values
- Default: *_skewness:nanmean,*_kurtosis:nanmean
- neginf=string[,string,...]
- Key:Value or Numpy funtion to use to substitute neginf values
- Default:
- posinf=string[,string,...]
- Key:Value or Numpy funtion to use to substitute posinf values
- Default:
- csv_test_cls=string
- csv file name with results of different machine learning scores
- Default: test_classifiers.csv
- report_class=string
- text file name with the report of different machine learning algorithms
- Default: classification_report.txt
- svc_c_range=float[,float,...]
- C value range list to explore SVC domain
- Default: 1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5,1e6,1e7,1e8
- svc_gamma_range=float[,float,...]
- gamma value range list to explore SVC domain
- Default: 1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4
- svc_kernel_range=string[,string,...]
- kernel value range list to explore SVC domain
- Default: linear,poly,rbf,sigmoid
- svc_poly_range=string[,string,...]
- polynomial order list to explore SVC domain
- Default:
- svc_n_jobs=integer
- number of jobs to use during the domain exploration
- Default: 1
- svc_c=float
- definitive C value
- svc_gamma=float
- definitive gamma value
- svc_kernel=string
- definitive kernel value. Available kernel are: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
- Default: rbf
- svc_img=string
- filename pattern with the image of SVC parameter
- Default: domain_%s.svg
- rst_names=string
- filename pattern for raster
- Default: %s
v.class.ml uses machine-learning algorithms to classify a vector maps
based on the values of its attribute table.
The module uses different machine-learning libraries available for python
at the moment uses:
scikit-learn (package name may be
"python-scikit-learn") and
MLPY, but should be possible to
add easily other python libraries.
The module is though to be use in a modular way, using the flags it is
possible to define which independent tasks should be execute.
- -e
- Extract the training set from a vector map (vtraining).
- -n
- Store: attribute table data, columns names, categories training data,
training index to a numpy binary files.
- -f
- Rank feature importances using a
ExtraTreesClassifier algorithm.
- -b
- Balance the training using the class with the minor number of training
samples or the parameter set in n_training.
- -o
- Optimize a balanced training dataset using the class with the minor
number of training samples or the parameter set in n_training.
- -c
- Classify the whole dataset.
- -r
- Export machine-learning results to raster maps.
- -t
- Test several machine-learning algorithms on your dataset.
- -v
- Test also the bias variance.
- -x
- Compute also extra parameters to evaluate different algorithms like:
confusion matrix, ROC, PR.
- -d
- Explore the Support Vector Classification (SVC) domain.
The vector parameter is the input vector map. The input vector map
must be prepared with v.category to copy the
categories to all the layers that will be created.
The vtraining parameter is a vector input map that can be used to
select the training areas. Currently only supervised classification is
implemented so this parameter is mandatory. The training vector map can
be generated using the GRASS standard tool for supervised classification
g.gui.iclass.
The vlayer parameter is the layer name or number of the
attribute tables with the data that must be used as input for the
machine-learning algorithms.
The tlayer parameter is the layer name or number of the
attribute tables where are or will be stored the training data for
the machine-learning algorithms.
The rlayer parameter is the layer name or number the
attribute tables where will be stored the machine-learning results.
The npy_data parameter is a string with the path to define where
the binary numpy files containing the complete
dataset will be saved.
The npy_cats parameter is a string with the path to define where
the binary numpy files containing the vector categories will be saved.
The npy_cols parameter is a string with the path to define where
the binary numpy files containing the column names of the data attribute table
will be saved.
The npy_index parameter is a string with the path to define where
the binary numpy files containing a boolean array to say if the category is
used or not as training.
The npy_tdata parameter is a string with the path to define where
the binary numpy files containing a training data array will be saved.
The npy_tclasses parameter is a string with the path to define where
the binary numpy files containing the training classes will be saved.
The npy_btdata parameter as npy_tdata but only for a balance dataset.
The npy_btclasses parameter as npy_tclasses but only for a balance
dataset.
The imp_csv parameter is a string with the path to define where a CSV
file containing the rank of the feature importances should be save.
The imp_fig parameter is a string with the path to define where a
figure file containing the rank of the feature importances should be save.
The scalar parameter is a string with scaler methods that will be
apply to pre-process the data. Two main methods are available:
with_mean, with_std.
This is a quite common task therefore the default parameter apply both methods.
The decomposition parameter is a string with scaler methods that will
be apply to pre-process the data. The main decomposition methods available are:
PCA,
KernelPCA,
ProbabilisticPCA,
RandomizedPCA,
FastICA,
TruncatedSVD.
Each of this methods could take several parameters. Use "|" as separator
between the decomposition method name and its options, using the "," to
separate the options. For examples imagine that we want to decompose using
the KernelPCA method with 10 number of components and using a linear kernel,
so the correct string is:
"KernelPCA|n_components=10,kernel=linear"
The n_training parameter is an integer with the number of training
that must be use per class. Some machine-learning methods are sensitive if the
training dataset is balanced or not. As default all the training will be used.
The pyclassifiers parameter is a file path to a python file containing
a list of dictionary to define classifiers class and options. See an example of
the default classifiers
used by the v.class.ml module.
The pyvar parameter is a string with the python variable name defined
in the pyclassifiers file.
The pyindx parameter is a string with the indexes of the classifiers
that will be used. In the string you could define a range using the minus
character, or list the index usig the comma as separator, or combine both
options together. For example: '1-5,34-36,40' it means that only classifiers
with index: 1, 2, 3, 4, 5, 34, 35, 36 and 40 will be used.
The pyindx_optimize parameter is a integer with the classifier index
that will be used to optimize a balance training dataset. This option is used
only if optimize is true otherwise will be ignored.
The nan parameter is a string that allows user to define for each
column in the attribute table which value or function should be used to
substitute NaN values. The syntax could be: 'col0:9999,col1:9999'.
The column name could be also a pattern, so it is possible to define a rule like:
'*_mean:nanmean,*_max:nanmax' that substitute in all the columns
that finish with '_mean' the mean value of the column and for column that end
with '_max' the maximum value.
This operation is needed because machine-learning algorithms are not able to
handle nan, inf, neginf, and posinf values.
The inf parameter is similar to nan, but instead of substituting nan
values the rules will be applied for infinite values.
The neginf parameter is similar to nan, but instead of substituting nan
values the rules will be applied for negative infinite values.
The posinf parameter is similar to nan, but instead of substituting nan
values the rules will be applied for positive infinite values.
The csv_test_cls parameter is the file name/path where the results of
the classification test will be written.
The report_class parameter is the file name/path where a summary for
each machine learning algorithms will be written.
The svc_c_range parameter is a range of C values that will be used
when exploring the domain of the Support Vector Machine algorithms.
The svc_gamma_range parameter is a range of gamma values that will
be used when exploring the domain of the Support Vector Machine algorithms.
The svc_kernel_range parameter is a range of kernel values that will
be used when exploring the domain of the Support Vector Machine algorithms.
The svc_n_jobs parameter is an integer with the number of process
that will be used during the domain exploration of Support Vector Machine
algorithms.
The svc_img parameter is the file name/path pattern of the image that
will be generated from the domain exploration.
The svc_c parameter is the definitive C value that will be used
for final classification.
The svc_gamma parameter is the definitive gamma value that will be
used for final classification.
The svc_kernel parameter is the definitive kernel value that will be
used for final classification.
The rst_names parameter is the name pattern that will be use to
generate the output raster map for each algorithm.
v.class.mlpy (a simpler module
for vector classification which uses
mlpy)
Pietro Zambelli, University of Trento
SOURCE CODE
Available at: v.class.ml source code (history)
Main index |
Vector index |
Topics index |
Keywords index |
Graphical index |
Full index
© 2003-2020
GRASS Development Team,
GRASS GIS 7.8.3dev Reference Manual