GRASS GIS manual: r.scatterplot

NAME

r.scatterplot - Creates a scatter plot of raster maps
Creates a scatter plot of two or more raster maps as a vector map

KEYWORDS

raster, statistics, diagram, correlation, scatter plot, vector

SYNOPSIS

r.scatterplot

r.scatterplot --help

r.scatterplot [-wfsub] input=name[,name,...] output=name [z_raster=name] [color_raster=name] [xscale=float] [yscale=float] [zscale=float] [position=east,north] [spacing=float] [vector_mask=name] [mask_layer=string] [mask_cats=range] [mask_where=sql_query] [--overwrite] [--help] [--verbose] [--quiet] [--ui]

Flags:

-w: Place into the current region south-west corner; The output coordinates will not represent the original values
-f: Automatically offset each scatter plot; The output coordinates will not represent the original values
-s: Put points into a single layer; Even with multiple rasters, put all points into a single layer
-u: Invert mask
-b: Do not build topology; Advantageous when handling a large number of points
--overwrite: Allow output files to overwrite existing files
--help: Print usage summary
--verbose: Verbose module output
--quiet: Quiet module output
--ui: Force launching GUI dialog

Parameters:

input=name[,name,...] [required]: Name of input raster map(s)
output=name [required]: Name for output vector map
z_raster=name: Name of input raster map to define Z coordinates
color_raster=name: Name of input raster map to define category and color
xscale=float: Scale to apply to X axis; Default: 1.0
yscale=float: Scale to apply to Y axis; Default: 1.0
zscale=float: Scale to apply to Z axis; Default: 1.0
position=east,north: Place to the given coordinates; The output coordinates will not represent the original values
spacing=float: Spacing between scatter plots; Applied when automatic offset is used
vector_mask=name: Areas to use in the scatter plots; Name of vector map with areas from where the scatter plot should be generated
mask_layer=string: Layer number or name for vector mask; Vector features can have category values in different layers. This number determines which layer to use. When used with direct OGR access this is the layer name.; Default: 1
mask_cats=range: Category values for vector mask; Example: 1,3,7-9,13
mask_where=sql_query: WHERE conditions for the vector mask; Example: income < 1000 and population >= 10000

DESCRIPTION
- Multiple variables
- Layout
NOTES
EXAMPLES
- Landsat bands
- High density scatter plots
SEE ALSO
AUTHOR

DESCRIPTION

The r.scatterplot module takes raster maps and creates a scatter plot which is a vector map and where individual points in the scatter plot are vector points. As with any scatter plot the X coordinates of the points represent values from the first raster map and the Y coordinates represent values from the second raster map. Consequently, the vector map is placed in the combined value space of the original raster maps and its geographic position should be ignored. Typically, it is necessary to zoom or to change computational in order to view the scatter plot or to perform further computations on the result.

With the default settings, the r.scatterplot output allows measuring and querying of the values in the scatter plot. Settings such as xscale or position option change the coordinates and make some of the measurements wrong.

Multiple variables

If more than two raster maps are provided to the input option, r.scatterplot creates a scatter plot for each unique pair of input maps. For example, if A, B, C, and D are the inputs, r.scatterplot creates scatter plots for A and B, A and C, A and D, B and C, B and D, and finally C and D. Each pair is part of different vector map layer. r.scatterplot provides textual output which specifies the pairs and associated layers.

A 3D scatter plot can be generated when the z_raster option is provided. A third variable is added to each scatter plot and each point has Z coordinate which represents this third variable.

Each point can also have a color based on an additional variable based on the values from color_raster. Values from a raster are stored as categories, i.e. floating point values are truncated to integers, and a color table based on the input raster color table is assigned to the vector map.

The z_raster and color_raster can be the same. This can help with understanding the 3D scatter plot and makes the third variable visible in 2D as well. When z_raster and color_raster are the same, total of four variables are associated with one point.

Figure: One scatter plot of two variables (left), the same scatter plot but with color showing third variable (middle), again the same scatter plot in 3D where Z represents a third variable (right).

Figure: One scatter plot in with one variable as Z coordinate and another variable as color (two rotated views).

Layout

When working only with variable, X axis represents the first one and Y axis the second one. With more than one variable, the individual scatter plots for individual pairs of variables are at the same place. In this case, the coordinates show the actual values of the variables. Each scatter plot is placed into a separate layer of the output vector map.

Figure: Three overlapping scatter plots of three variables A, B, and C. Individual scatter plots are distinguished by color. The colors can be obtained using d.vect layer=-1 -c.

If visualization is more important than preserving the actual values, the -s flag can be used. This will place the scatter plots next to each other separated by values provided using spacing option.

The layout options can be still combined with additional variables represented as Z coordinate or color. In that case, Z coordinate or color is same for all the scatter plots.

Figure: Three scatter plots of three variables A, B, and C. First one is A and B, second A and C, and third B and C.

Figure: Three scatter plots of three variables A, B, and C with color showing a fourth variable D in all scatter plots.

The options xscale, yscale and zscale will cause the values to be rescaled before they are stored as point coordinates. This is useful for visualization when one of the variables has significantly different range than the other or when the scatter plot is shown with other data and must fit a certain area. The position option is used to place the scatter plot to any given coordinates. Similarly, -w flag can be used to place it to the south-west corner of the computation region.

NOTES

The resulting vector will have as many points as there is 3D raster cells in the current computational region. It might be appropriate to use coarser resolution for the scatter plot than for the other computations. However, note that the some values will be skipped which may lead, e.g. to missing some outliers.

The color_raster input is expected to be categorical raster or have values which won't loose anything when converted from floating point to integer. This is because vector categories are used to store the color_raster values and carry association with the color.

The visualization of the output vector map has potentially the same issue as visualization of any vector with many points. The points cover each other and above certain density of points, it is not possible to compare relative density in the scatter plot. Furthermore, if colors are associated with the points, the colors of points rendered last are those which are visible, not actually showing the prevailing color (value). The modules v.mkgrid and v.vect.stats can be used to overcome this issue.

EXAMPLES

Landsat bands

In the full North Carolina sample location, set the computation region to one of the raster maps:

g.region raster=lsat7_2002_30

Create the scatter plot:

r.scatterplot input=lsat7_2002_30,lsat7_2002_40 output=scatterplot color_raster=landclass96

Figure: Scatter plot showing red and near infrared Landsat bands colored using land cover classes

High density scatter plots

In an ideal case, the scatter plot is computed with the computation region resolution set to the resolutions of one of the rasters (which ideally matches the other raster as well):

g.region raster=lsat7_2002_30 -p
r.scatterplot input=lsat7_2002_30,lsat7_2002_40 output=scatterplot_full_res

This best describes the actual state of the data, but unfortunately this creates a lot of points which must be processed and rendered. Therefore, it is also possible to compute the scatter plot in a lower resolution by changing the computational region resolution:

g.region raster=lsat7_2002_30 res=120 -p
r.scatterplot input=lsat7_2002_30,lsat7_2002_40 output=scatterplot_res_120

Reducing the resolution creates a possibility of missing some outliers or even smaller groups as some of the cells are just ignored, but typically the general shape of the scatter plot is preserved. In any case, with less points, every operation will by much faster.

Figure: Scatter plots computed with different computational region resolutions; first one is with full raster resolution (~30 m) second with resolution 120 m, and third with 240 m

Another way of dealing with hight density scatter plots is to bin the points into cells of a rectangular grid. Number of points per cells with influence color of the cell, so the density will be expressed clearly. The scatter plot can be computed in full resolution:

g.region raster=lsat7_2002_30 -p
r.scatterplot input=lsat7_2002_30,lsat7_2002_40 output=scatterplot

To create the grid the computation region extent should match the scatter plot extent. The resolution determines the size of the grid cells. 5 is a good size for data from 0 to 255.

g.region vector=scatterplot res=5 -p

The grid can be created using v.mkgrid module, the binning done using v.vect.stats, and finally the color is set using v.colors.

v.mkgrid map=scatterplot_grid
v.vect.stats points=scatterplot areas=scatterplot_grid count_column=count
v.colors map=scatterplot_grid use=attr column=count color=viridis

The d.vect module picks up the color table automatically, but it is advantageous to also specify that only the grid cells with non-zero count of points should be displayed using where="count > 0":

d.vect map=scatterplot_grid where="count > 0" icon=basic/point

To get more interesting and sometimes smoother look, hexagonal grid can be used:

v.mkgrid -h map=scatterplot_grid

Alternatively, a smaller cell size can be used. When the cell size is the same as the distance between the points, like for example using cells size 1 with integer rasters, the grid needs to be shifted so that the points fall into the middle of the cells rather than on the edges or corners. For these purposes the g.region accepts modifications of the current extent values:

g.region vector=scatterplot res=1 w=w-0.5 e=e+0.5 s=s-0.5 n=n+0.5

Figure: High density scatter plot visualized using binning into rectangular grid, hexagonal grid, and dense rectangular grid

AUTHOR

Vaclav Petras, NCSU GeoForAll Lab

SOURCE CODE

Available at: r.scatterplot source code (history)