sgsPy
structurally guided sampling
Loading...
Searching...
No Matches

Classes

struct  sgs::pca::PCAResult< T >

Functions

template<typename T>
PCAResult< T > sgs::pca::calculatePCA (std::vector< helper::RasterBandMetaData > &bands, GDALDataType type, size_t size, int width, int height, int nComp)
template<typename T>
PCAResult< T > sgs::pca::calculatePCA (std::vector< helper::RasterBandMetaData > &bands, GDALDataType type, size_t size, int xBlockSize, int yBlockSize, int xBlocks, int yBlocks, int nComp)
template<typename T>
void sgs::pca::writePCA (std::vector< helper::RasterBandMetaData > &bands, std::vector< helper::RasterBandMetaData > &PCABands, PCAResult< T > &result, GDALDataType type, size_t size, int height, int width)
template<typename T>
void sgs::pca::writePCA (std::vector< helper::RasterBandMetaData > &bands, std::vector< helper::RasterBandMetaData > &PCABands, PCAResult< T > &result, GDALDataType type, size_t size, int xBlockSize, int yBlockSize, int xBlocks, int yBlocks)
std::tuple< raster::GDALRasterWrapper *, std::vector< std::vector< double > >, std::vector< double >, std::vector< double >, std::vector< double > > sgs::pca::pca (raster::GDALRasterWrapper *p_raster, int nComp, bool largeRaster, std::string tempFolder, std::string filename, std::map< std::string, std::string > driverOptions)

Detailed Description

Function Documentation

◆ calculatePCA() [1/2]

template<typename T>
PCAResult< T > sgs::pca::calculatePCA ( std::vector< helper::RasterBandMetaData > & bands,
GDALDataType type,
size_t size,
int width,
int height,
int nComp )

This function is used by the pca() function to calculate the principal component eigenvectors and eigenvalues, along with the mean and standard deviation of each input raster band. This function is used in the case where the input raster is small, and can reasonably be expected to fit entirely into memory.

First, the input raster bands are read into memory usign the GDALRasterBand RasterIO function. Bands are read into memory in a row-wise manor such that a row indicates a single pixel, and a column indicates a raster band. This means that in between each pixel and the next, a gap must be left for the remaining band values for that pixel index to be written to. This is done using the nPixelSpace, and nLineSpace arguments of RasterIO.

Second, each pixel is checked to ensure it isn't a nan pixel. Any pixel containing a nan value in any band is overwritten completely with the next not-nan pixel, the total number of not-nan pixels is stored as the number of features.

The mean, standard deviation are then calculated using Welfords method, and the pca eigenvectors and eigenvalues are calculated using the oneDAL library principal components functionality.

A result containing the eigenvectors, eigenvalues, mean per band, and standard deviation per band, is returned.

Parameters
std::vector<RasterBandMetaData>&bands,
GDALDataTypetype
size_tsize
intwidth
intheight
intnComp
Returns
PCAResult<T>

◆ calculatePCA() [2/2]

template<typename T>
PCAResult< T > sgs::pca::calculatePCA ( std::vector< helper::RasterBandMetaData > & bands,
GDALDataType type,
size_t size,
int xBlockSize,
int yBlockSize,
int xBlocks,
int yBlocks,
int nComp )

This function is used by the pca() function to calculate the principal component eigenvectors and eigenvalues, along with the mean and standard deviation of each input raster band. This function is used in the case where the input raster is large, will be processed in blocks.

All of the blocks are iterated through, and within each iteration the following is done:

First, the input raster band blocks are read into memory using the GDALRasterBand RasterIO function. Bands are read into memory in a row-wise manor such that a row indicates a single pixel, and a column indicates a raster band. This means that in between each pixel and the next, a gap must be left for the remaining band values for that pixel index to be written to. This is done using the nPixelSpace, and nLineSpace arguments of RasterIO.

Second, each pixel is checked to ensure it isn't a nan pixel. Any pixel containing a nan value in any band is overwritten completely with the next not-nan pixel, the total number of not-nan pixels is stored as the number of features.

The mean, standard deviation are then updated using Welfords method, and the pca eigenvectors and eigenvalues partial result are updated using the oneDAL library principal components functionality.

once all blocks have been iterated through, the final resulting mean per band, standard deviation per band, eigenvectors, and eigenvalues are calculated and returned.

Parameters
std::vector<RasterBandMetaData>&bands
GDALDataTypetype
size_tsize
intxBlockSize
intyBlockSize
intxBlocks
intyBlocks
intnComp
Returns
PCAResult<T>

◆ pca()

std::tuple< raster::GDALRasterWrapper *, std::vector< std::vector< double > >, std::vector< double >, std::vector< double >, std::vector< double > > sgs::pca::pca ( raster::GDALRasterWrapper * p_raster,
int nComp,
bool largeRaster,
std::string tempFolder,
std::string filename,
std::map< std::string, std::string > driverOptions )

This function conducts principal component analysis on the input raster, writing output bands to a new GDALRasterWrapper, and returning the eigenvectors and eigenvalues calculated for each raster band. The output values are both centered and scaled before being projected onto the pca eigenvectors.

First, depending on whether the raster is large (should be processed in blocks) or not, and whether an output filename is given, an output dataset is created to store the output results. In the case of a small raster without a given filename, an in-memory raster is created. In the case of a large raster without a given filename, a VRT dataset is created where each VRT band is a GTiff raster. When a filename is created, the driver which corresponds to that filename is used.

Then, the calculatePCA() function is called, with specific template parameters depending on the data type, and a specific function overload depending on whether the raster should be processed by blocks. This function calculates the principal component eigenvectors, eigenvalues, mean per band, and standard deviation per band. The writePCA() function is then called (again with specific template and overload) to center, scale, and project the input raster values to output pca bands which are written to the output dataset.

Finally, a GDALRasterWrapper is created using the output dataset, and returned in a tuple alongside the eigenvectors and eigenvalues.

Parameters
GDALRasterWrapper*p_raster
intnComp
boollargeRaster
std::stringtempFolder
std::stringfilename
std::mape<std::string,std::string>driverOptions
Returns
std::tuple< GDALRasterWrapper *, std::vector<std::vector<double>> std::vector<double> >

◆ writePCA() [1/2]

template<typename T>
void sgs::pca::writePCA ( std::vector< helper::RasterBandMetaData > & bands,
std::vector< helper::RasterBandMetaData > & PCABands,
PCAResult< T > & result,
GDALDataType type,
size_t size,
int height,
int width )

This function is used to write the output principal components to a raster dataset, after the eigenvectors and eigenvalues have already been calculated for the input raster. This function is used in the case where the raster is small, and would not be expected to cause errors for being entirely in memory.

First, the input raster bands are read into memory using the GDALRasterBand RasterIO function. Bands are read into memory in a row-wise manor such that a row indicates a single pixel, and a column indicates a raster band. This means that in between each pixel and the next, a gap must be left for the remaining band values for that pixel index to be written to. This is done using the nPixelSpace, and nLineSpace arguments of RasterIO. The data pixels are iterated over: scaled, shifted, and set to nan if at a no data pixel.

Next, a matrix of pca eigenvectors are allocated and read into a new location.

Both the data matrix and the pca matrix are turned into oneDAL homogen tables, and the result of a linear kernel calculation is written to the output.

The reason a linear kernel is used, is because the result is essentially just a bunch of dot products. It's possible to do these dot products one at a time for each output pixel and component. However, the linear kernel, which is originally meant for fast machine learning use, does exactly what we need.

Parameters
std::vector<RasterBandMetaData>&bands
std::vector<rasterBandMetaData>&PCABands
PCAResult<T>&result,
GDALDataTypetype,
size_tsize,
intheight
intwidth

the result for each output principal component pixel is just the dot product of that pixel's data values with the corresponding principal component eigenvector.

oneDAL has a fast way to calculate dot products which is originally meant to be used for machine learning, but it does exactly what we need – multiply large matrices.

◆ writePCA() [2/2]

template<typename T>
void sgs::pca::writePCA ( std::vector< helper::RasterBandMetaData > & bands,
std::vector< helper::RasterBandMetaData > & PCABands,
PCAResult< T > & result,
GDALDataType type,
size_t size,
int xBlockSize,
int yBlockSize,
int xBlocks,
int yBlocks )

This function is used to write the output principal components to a raster dataset, after the eigenvectors and eigenvalues have already been calculated for the input raster. This function is used in the case where the raster is large, and should be processed in blocks.

For each block:

First, the input raster bands are read into memory using the GDALRasterBand RasterIO function. Bands are read into memory in a row-wise manor such that a row indicates a single pixel, and a column indicates a raster band. This means that in between each pixel and the next, a gap must be left for the remaining band values for that pixel index to be written to. This is done using the nPixelSpace, and nLineSpace arguments of RasterIO. The data pixels are iterated over: scaled, shifted, and set to nan if at a no data pixel.

Next, a matrix of pca eigenvectors are allocated and read into a new location.

Both the data matrix and the pca matrix are turned into oneDAL homogen tables, and the result of a linear kernel calculation is written to the output.

The reason a linear kernel is used, is because the result is essentially just a bunch of dot products. It's possible to do these dot products one at a time for each output pixel and component. However, the linear kernel, which is originally meant for fast machine learning use, does exactly what we need.

Parameters
std::vector<RasterBandMetaData>&bands
std::vector<RasterBandMetaData>&PCABands
PCAResult<T>&result
GDALDataTypetype
size_tsize
intxBlockSize
intyBlockSize
intxBlocks
intyBlocks

the result for each output principal component pixel is just the dot product of that pixel's data values with the corresponding principal component eigenvector.

oneDAL has a fast way to calculate dot products which is originally meant to be used for machine learning (as I understand it) but it does exactly what we need – multiply large matrices.