sgsPy
structurally guided sampling
Loading...
Searching...
No Matches

Functions

void sgs::quantiles::calcSPQuantiles (raster::GDALRasterWrapper *p_raster, helper::RasterBandMetaData &band, std::vector< double > &probabilities, std::vector< double > &quantiles)
void sgs::quantiles::calcDPQuantiles (raster::GDALRasterWrapper *p_raster, helper::RasterBandMetaData &band, std::vector< double > &probabilities, std::vector< double > &quantiles)
void sgs::quantiles::batchCalcSPQuantiles (raster::GDALRasterWrapper *p_raster, helper::RasterBandMetaData &band, std::vector< double > &probabilities, std::vector< double > &quantiles, std::mutex &mutex, std::condition_variable &cv, bool &calculated, double eps)
void sgs::quantiles::batchCalcDPQuantiles (raster::GDALRasterWrapper *p_raster, helper::RasterBandMetaData &band, std::vector< double > &probabilities, std::vector< double > &quantiles, std::mutex &mutex, std::condition_variable &cv, bool &calculated, double eps)
void sgs::quantiles::processMapPixel (size_t index, helper::RasterBandMetaData &dataBand, void *p_dataBuffer, helper::RasterBandMetaData &stratBand, void *p_stratBuffer, std::vector< double > &quantiles, size_t multiplier, bool &mapNan, size_t &mapStrat)
void sgs::quantiles::processPixel (size_t index, void *p_data, helper::RasterBandMetaData *p_dataBand, void *p_strat, helper::RasterBandMetaData *p_stratBand, std::vector< double > &quantiles)
std::pair< raster::GDALRasterWrapper *, std::unordered_map< std::string, std::vector< double > > > sgs::quantiles::quantiles (raster::GDALRasterWrapper *p_raster, std::map< int, std::vector< double > > userProbabilites, bool map, std::string filename, std::string tempFolder, bool largeRaster, int threadCount, std::map< std::string, std::string > driverOptions, double eps)

Detailed Description

Function Documentation

◆ batchCalcDPQuantiles()

void sgs::quantiles::batchCalcDPQuantiles ( raster::GDALRasterWrapper * p_raster,
helper::RasterBandMetaData & band,
std::vector< double > & probabilities,
std::vector< double > & quantiles,
std::mutex & mutex,
std::condition_variable & cv,
bool & calculated,
double eps )

This helper function is used to calculate the quantiles of a large raster which is more efficient to calculate in batches rather than trying to allocate into memory. This is the double precision version of this function.

Intel's MKL (Math Kernel Library) package is used to calculate the quantiles. The MKL package does not account for nan values, so the data must be filtered first, leaving only the data pixel values left.

the following MKL VSL (Vector Statistics Library) functions are used to calculate the quantiles: vsldSSCreateTask() vsldSSEditStreamQuantiles() vsldSSCompute() vslSSDeleteTask()

CreateTask is called at the start of execution, and VSLsSSEditStreamQuantiles() is called after the first set of values has been written into the buffer. It was found that calling this function before resulted in incorrect values, although the funciton still only has to be called once. The Compute function is called many times, to continuously update the quantiles across batches. The fast calculation method is used, which means afterwards the Compute funciton must be called again with the normal method and an observable count of 0 to get the final quantile values. Quantile streaming algorithms are not exact, since the entire raster must be in memory to get the most precise possible values. However, the amount of error can be controlled to a specific epsilon (eps) value.

The Quantile streaming method is the method introduced by Zhang et al. and utilized by MKL: https://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-summary-statistics-notes/2021-1/computing-quantiles-with-vsl-ss-method-squants-zw.html

Due to the fact that large raster processing in batches is multi-threaded in SGS, a condition variable (cv) is used to ensure a thread which would be using the calculated quantiles doesn't begin processing or writing any data before the quantiles are finalized. The shared resource is a boolean representing the band which is set to true by this thread once the quantiles are calculated. Any number of processing threads may be waiting on the quantiles of this band to be calculated, so notify_all is called on the cv to wake those threads up.

Parameters
GDALRasterWrapper*p_raster
RasterBandMetaData&band
std::vector<double>&probabilities
std::vector<double>&quantiles
std::mutex&mutex
std::condition_variable&cdv
bool&calculated
doubleeps

◆ batchCalcSPQuantiles()

void sgs::quantiles::batchCalcSPQuantiles ( raster::GDALRasterWrapper * p_raster,
helper::RasterBandMetaData & band,
std::vector< double > & probabilities,
std::vector< double > & quantiles,
std::mutex & mutex,
std::condition_variable & cv,
bool & calculated,
double eps )

This helper function is used to calculate the quantiles of a large raster which is more efficient to calculate in batches rather than trying to allocate into memory. This is the single precision version of this function, which is used for all raster data types except double precision floating point values.

Both the quantile probabilities, and the quantiles themselves are double precision, however, so they must be converted before and after execution to and from single precision floating point.

Intel's MKL (Math Kernel Library) package is used to calculate the quantiles. The MKL package does not account for nan values, so the data must be filtered first, leaving only the data pixel values left.

the following MKL VSL (Vector Statistics Library) functions are used to calculate the quantiles: vslsSSCreateTask() vslsSSEditStreamQuantiles() vslsSSCompute() vslSSDeleteTask()

CreateTask is called at the start of execution, and VSLsSSEditStreamQuantiles() is called after the first set of values has been written into the buffer. It was found that calling this function before resulted in incorrect values, although the funciton still only has to be called once. The Compute function is called many times, to continuously update the quantiles across batches. The fast calculation method is used, which means afterwards the Compute funciton must be called again with the normal method and an observable count of 0 to get the final quantile values. Quantile streaming algorithms are not exact, since the entire raster must be in memory to get the most precise possible values. However, the amount of error can be controlled to a specific epsilon (eps) value.

The Quantile streaming method is the method introduced by Zhang et al. and utilized by MKL: https://web.cs.ucla.edu/~weiwang/paper/SSDBM07_2.pdf https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-summary-statistics-notes/2021-1/computing-quantiles-with-vsl-ss-method-squants-zw.html

Due to the fact that large raster processing in batches is multi-threaded in SGS, a condition variable (cv) is used to ensure a thread which would be using the calculated quantiles doesn't begin processing or writing any data before the quantiles are finalized. The shared resource is a boolean representing the band which is set to true by this thread once the quantiles are calculated. Any number of processing threads may be waiting on the quantiles of this band to be calculated, so notify_all is called on the cv to wake those threads up.

Parameters
GDALRasterWrapper*p_raster
RasterBandMetaData&band
std::vector<double>&probabilities
std::vector<double>&quantiles
std::mutex&mutex
std::condition_variable&cdv
bool&calculated
doubleeps

◆ calcDPQuantiles()

void sgs::quantiles::calcDPQuantiles ( raster::GDALRasterWrapper * p_raster,
helper::RasterBandMetaData & band,
std::vector< double > & probabilities,
std::vector< double > & quantiles )

This helper function is used to calculate the quantiles of a raster which is entirely in memory. This is the double precision version of this function.

Intel's MKL (Math Kernel Library) package is used to calculate the quantiles. The MKL package does not account for nan values, so the data must be filtered first, leaving only the data pixel values left.

the following MKL VSL (Vector Statistics Library) functions are used to calculate the quantiles: vsldSSCreateTask() vsldSSEditQuantiles() vsldSSCompute() vslSSDeleteTask()

Parameters
GDALRasterWrapper*p_raster
RasterBandMetaData&band
std::vector<double>&probabilities
std::vector<double>&quantiles

◆ calcSPQuantiles()

void sgs::quantiles::calcSPQuantiles ( raster::GDALRasterWrapper * p_raster,
helper::RasterBandMetaData & band,
std::vector< double > & probabilities,
std::vector< double > & quantiles )

This helper function is used to calculate the quantiles of a raster which is entirely in memory. This is the single precision version of this function, which is used for all raster data types except double precision floating point values.

Both the quantile probabilities, and the quantiles themselves are double precision, however, so they must be converted before and after execution to and from single precision floating point.

Intel's MKL (Math Kernel Library) package is used to calculate the quantiles. The MKL package does not account for nan values, so the data must be filtered first, leaving only the data pixel values left.

the following MKL VSL (Vector Statistics Library) functions are used to calculate the quantiles: vslsSSCreateTask() vslsSSEditQuantiles() vslsSSCompute() vslSSDeleteTask()

Parameters
GDALRasterWrapper*p_raster
RasterBandMetaData&band
std::vector<double>&probabilities
std::vector<double>&quantiles

◆ processMapPixel()

void sgs::quantiles::processMapPixel ( size_t index,
helper::RasterBandMetaData & dataBand,
void * p_dataBuffer,
helper::RasterBandMetaData & stratBand,
void * p_stratBuffer,
std::vector< double > & quantiles,
size_t multiplier,
bool & mapNan,
size_t & mapStrat )
inline

This is a helper function for processing a pixel of data when a mapped stratification is being created.

First, the value is read in as a double, and it is determined whether the pixel is a nan pixel or not. The mapNan boolean is updated in addition to the isNan boolean, to ensure that if one band within the raster is nan at a certain pixel then the mapped raster (but not necessarily all output rasters) is also nan at that pixel.

Then, if it isn't a nan pixel the lower bound of the value within the vector of break values is found. For example, if the value was 3 and the breaks vector was [2, 4, 6], the lower bound would be 1, which is the index of 4, the first value larger than 3 in the breaks vector. This lower bound is the strata. This strata (or the nan value) is then written with the appropriate type to the strat raster band.

Parameters
size_tindex
RasterBandMetaData&dataBand
void*p_dataBuffer
RasterBandMetaData&stratBand
void*p_stratBuffer
std::vector<double>&quantiles
size_tmultiplier
bool&mapNan
size_t&mapStrat

◆ processPixel()

void sgs::quantiles::processPixel ( size_t index,
void * p_data,
helper::RasterBandMetaData * p_dataBand,
void * p_strat,
helper::RasterBandMetaData * p_stratBand,
std::vector< double > & quantiles )
inline

This is a helper function for processing a pixel of data.

First, the value is read in as a double, and it is determined whether the pixel is a nan pixel or not.

Then, if it isn't a nan pixel the lower bound of the value within the vector of break values is found. For example, if the value was 3 and the breaks vector was [2, 4, 6], the lower bound would be 1, which is the index of 4, the first value larger than 3 in the breaks vector. This lower bound is the strata. This strata (or the nan value) is then written with the appropriate type to the strat raster band.

Parameters
size_tindex
void*p_data
RasterBandMetaData*p_dataBand
void*p_strat
RasterBandMetaData*p_stratBand
std::vector<double>&quantiles

◆ quantiles()

std::pair< raster::GDALRasterWrapper *, std::unordered_map< std::string, std::vector< double > > > sgs::quantiles::quantiles ( raster::GDALRasterWrapper * p_raster,
std::map< int, std::vector< double > > userProbabilites,
bool map,
std::string filename,
std::string tempFolder,
bool largeRaster,
int threadCount,
std::map< std::string, std::string > driverOptions,
double eps )

This function stratifies a given raster using user-defined probabilities. The probabilities (quantiles) are provided as a vector of doubles mapped to a band index.

The function can be run on a single raster band or multiple raster bands, and the user may pass the map variable to combine the stratification of the given raster bands.

The function can be thought of in four different sections: the setup, the quantiles calculation, the processing, and the finish/return. During the setup, metadata is acquired for the input raster, and an output dataset is created which depends on user-given parameters and the input raster. During the quantiles calculation, Intel's Math Kernel Library (MKL) is used to calculate quantiles either with the entire raster in memory or in batches. During the processing step the raster is iterated through, either by blocks or with the entire raster in memory, the strata are determined for each pixel and then written to the output dataset. During the finish/return step, a GDALRasterWrapper object is created using the output dataset.

SETUP: the data structures holding metadata are initialized and it is determined whether the raster is a virtual raster or not, and if it is a virtual raster whether it is fully in-memory or whether it must be stored on disk.

If the user provides an output filename, the dataset will not be a virtual dataset instead it will be associated with the filename. If the user does not provide an output filename then a virtual dataset driver will be used. In the case of a large raster (whether or not the raster is large enough for this is calculated and passed by hte Python side of the application), the dataset will be VRT. If the package is comfortable fitting the entire raster in memory, an in-memory dataset will be used.

The input raster bands are iterated through, metadata is stored on them, and bands are created for the output dataset. In the case of a VRT dataset, each band is a complete dataset itself which must be added after it has been written to. In the case of a MEM dataset, the bands must be acquired from the input raster. Both MEM and VRT support the addBand function, and support bands of different types. Thus, the bands are added dynamically while iterating through input raster bands. Non virtual formats require the data types to be known at dataset initialization and don't support the addBand function, so the dataset must be created after iterating through the input bands.

QUANTILES CALCULATION: There are four different functions which do the quantiles calculation. Different functions are used depending on whether the raster should be batch processed, or whether it is entirely in memory. Further, there are different functions for single precision (float) or double precision (double) floating point values. Intels Math Kernel Library (MKL) is used to calculate the quantiles.

For single precision vs. double precision, the only case where double precision would be used is if the raster data type is double. the functions within the MKL library are slightly different for double and single precision. Further, the input and output vectors representing quantile probabilities and the quantiles themselves are vectors of double, so in the case where single precision is used, intermediate versions of the data must be written to and read from at the beginning and end of the function. The reasing why double precision is not used in every case, is because the same amount of values of single precision take up half the ammount of memory as their double precision counterparts. As a result, the memory usage due to intermediate stored values should be significantly less for single precision floating point values.

For batch processing vs non-batch processing, slightly different algorithms are used. The batch processing algorithm is a quantile streaming algorithm within MKL which creates a fast and accurate approximation of the quantiles without requiring the whole raster to be in memory at once. More information is contained in the documentation for those particular functions.

Information on the quantile streaming algorithm can be found here:

PROCESSING: the processing section iterated through ever pixel in every input band, and calculates/writes the strata to the corresponding output band.

There are four different cases dealing with whether or not the entire raster band is allocated in memory (the largeRaster bariable is false), and whether or not the values of each band should be mapped to an extra output raster band.

If the raster is large, it is porcessed in blocks and split into groups of blocks to be processed by multiple threads. If the raster bands are in-memory, the entire raster is processed at once by a single thread. The mapped rasters store information on an extra output raster band, the output values of which are determined as a function of all other output raster bands. The multipliers vector stores the information for this.

For the large rasters, the processing starts out by splitting the raster into chunks depending on the number of threads. A thread is then ctreated for each chunk. Within each thread, the blocks within it's designated chunk are iterated through and first read from the input bands, processed, then written to the output bands. In the case of a mapped raster all of the bands are iterated through alongside achother so that the intermediate mapping calculations don't have to be written then read again. In the case of a non mapped raster, each band is processed sequentially.

CLEANUP: If the output dataset is a VRT dataset, the dataset which represent its bands (which have not yet been added as bands) must be added as bands now that they are populated with data and are thus allowed to be added.

If the dataset output bands are fully in memory, they are moved to a vector from their metadata objects to be passed as a parameter to the GDALRasterWrapper constructor (or not if the bands aren't in memory). This GDALRasterWrapper is then returned.

Parameters
GDALRasterWrapper*p_raster
std::map<int,std::vector<double>>userProbabilities
boolmap
std::stringfilename
std::stringtempFolder,
boollargeRaster,
intthreadCount
std::map<std::string,std::string>driverOptions,
doubleeps
Returns
GDALRasterWrapper *pointer to newly created stratified raster