Datasets
BaseDSet
- class fireball.datasets.base.BaseDSet(dsName, dataPath, samples, labels, batchSize, numWorkers=0)
The base class for all other Dataset classes defined in this folder. This class can be used as the base class for both classification and regression problems.
Please refer to the mnist.py for an example of how to derive from this class.
Constructs a BaseDSet class. Never called directly. It is always called from constructor of a derived class.
- Parameters:
dsName (str) – The name of the dataset. Common names include “Train”, “Test”, and “Valid”. Some derived classes may include additional application-specific names.
dataPath (str) – The path to the directory where the dataset files are located.
samples (list, numpy array, or None) –
If specified it is used as the samples for the dataset. Depending on the application it can be a list or a numpy array.
If samples is not specified, the loadSamples() method is called. This method MUST be implemented by any derived class.
labels (list, numpy array, or None) – If specified it will be used as the labels for the dataset. Depending on the application it can be a list or a numpy array.
batchSize (int) – The default batch size used in the “batches” method.
numWorkers (int, optional) – If numWorkers is more than zero, “numWorkers” worker threads are created to process and prepare future batches in parallel.
- __init__(dsName, dataPath, samples, labels, batchSize, numWorkers=0)
Constructs a BaseDSet class. Never called directly. It is always called from constructor of a derived class.
- Parameters:
dsName (str) – The name of the dataset. Common names include “Train”, “Test”, and “Valid”. Some derived classes may include additional application-specific names.
dataPath (str) – The path to the directory where the dataset files are located.
samples (list, numpy array, or None) –
If specified it is used as the samples for the dataset. Depending on the application it can be a list or a numpy array.
If samples is not specified, the loadSamples() method is called. This method MUST be implemented by any derived class.
labels (list, numpy array, or None) – If specified it will be used as the labels for the dataset. Depending on the application it can be a list or a numpy array.
batchSize (int) – The default batch size used in the “batches” method.
numWorkers (int, optional) – If numWorkers is more than zero, “numWorkers” worker threads are created to process and prepare future batches in parallel.
- property evalMetricBiggerIsBetter
Returns True for the evaluation metric of this dataset, a larger value is better. The default implementation in this base class returns True for metrics mAP, Accuracy, and PSNR and False for other metrics.
- property isTraining
Returns True if this dataset instance is a training dataset.
- getLabelAt(idx)
A method to return the label for the sample specified by the idx. This function should be implemented by the derived classes if they have a special way of accessing labels. For an example see the implementation of this method in the ImageNetDSet class.
- Parameters:
idx (int) – The index specifying the sample in the dataset.
- Returns:
The label for the specified sample. For classification problems this is usually an integer specifying the class the sample belongs to. For regression problem this may be a single floating point value or a numpy array.
- Return type:
int, float, or numpy array
- loadSamples()
This method loads the dataset information from the dataset files. For the base class, this is just a place holder.
Note
This method MUST be implemented by the derived classes.
- classmethod makeDatasets(dsNames='Train,Test,Valid', batchSize=32, dataPath=None, printInfo=False)
This class method creates Train, Test, and/or Validation datasets. For the base class, this is just a place holder.
Note
This method MUST be implemented by the derived classes.
- classmethod postMakeDatasets()
This class method is called at the end of a call to makeDatasets. This can be used by derived classes to setup the dataset class after the datasets are created.
- split(dsName='Valid', ratio=0.1, batchSize=None, repeatable=True)
This method must be implemented by the derived classes.
- getSplitIndexes(ratio=0.1, repeatable=True)
This function returns a list of sample indexes that can be used to split this dataset. This is a utility function usually used by the split function implemented by the derived classes. As an example, refer the implementation of the split function in the MnistDSet class.
- Parameters:
ratio (float, optional) – The ratio of the number of split indexes to the total number of samples in this dataset. Default is 10 percent of the sample.
repeatable (Boolean, optional) – If True, the sampling from the original dataset is deterministic and therefore the experiments are repeatable. Otherwise, the sampling is done randomly.
- Returns:
A list of indexes of the samples included in the split.
- Return type:
list
- mergeWith(otherDSet)
This function merges the contents of this dataset with the contents of another dataset specified by otherDSet. The datasets must have similar properties. The dataset otherDSet remains unchanged.
- Parameters:
otherDSet (any object derived from BaseDSet) – The contents of this dataset is merged with “self”.
- classmethod printDsInfo(trainDs=None, testDs=None, validDs=None)
This class method prints information about given set of datasets in a single table.
- Parameters:
trainDs (any object derived from BaseDSet, optional) – The training dataset.
testDs (any object derived from BaseDSet, optional) – The test dataset.
validDs (any object derived from BaseDSet, optional) – The validation dataset.
- classmethod printStats(trainDs=None, testDs=None, validDs=None)
This class method prints statistics of classes for the given set of datasets in a single table. This is used only for the classification datasets.
- Parameters:
trainDs (any object derived from BaseDSet, optional) – The training dataset.
testDs (any object derived from BaseDSet, optional) – The test dataset.
validDs (any object derived from BaseDSet, optional) – The validation dataset.
- boostClass(classIndex, ratio)
This method increases the number of samples for the specified class by the specified ratio. For example if the ratio is 2, the dataset will have twice the original number of sample for the specified class.
This can be used to add bias for a specific class for training. It is recommended to call this function just before calling the train function of the model.
- Parameters:
classIndex (int) – A index of class. Must be in “range(self.numClasses)”
ratio (float) – The ratio of the new number of samples for the specified class to the original number of samples for that class.
- getBatch(batchIndexes)
This method returns a batch of samples and labels from the dataset as specified by the list of indexes in the batchIndexes parameter. This is a generic implementation that works for most cases. A derived class may implement a customized version of this method.
- Parameters:
batchIndexes (list of int) – A list of indexes used to pick samples and labels from the dataset.
- Returns:
samples (list or numpy array) – The batch samples specified by the batchIndexes.
labels (list or numpy array) – The batch labels specified by the batchIndexes.
- batches(iterBatchSize=None, numWorkers=None, sampleIndexes=None)
This generator function is used to loop through all the samples. This is used by the train method of a Model object during the training. This can also be used when evaluating a model using validation or test datasets.
If current dataset is a training or fine-tuning dataset, the samples are shuffled before the start of the loop.
- Parameters:
iterBatchSize (int, optional) – If not specified the default batch size specified in the __init__ is used. Otherwise this value overrides the predefine batchSize.
numWorkers (int or None, optional) – If not specified (=None, default) the dataset’s numWorkers is used. If numWorkers is more than zero, “numWorkers” worker threads are created to process and prepare future batches in parallel.
sampleIndexes (list of int, optional) – if specified the batches are taken only from the samples specified by “sampleIndexes”. Otherwise (default), all available samples are considered for the batches.
- Yields:
samples (numpy array) – The next batch of samples.
labels (list of 2-tuples or a 3-tuple of numpy arrays) – The next batch of labels.
Note
This function first obtains a list of indexes for the next batch of the dataset. If the number of workers is 0, it calls the getBatch function to return the batch samples and labels.
If number of workers is not zero, the batch indexes are put in the jobs Queue which are then used by one of the worker threads.
- evaluateModel(model, batchSize=None, quiet=False, returnMetric=False, **kwargs)
This function evaluates the specified model using this dataset.
- Parameters:
model (Fireball Model object) – The model being evaluated.
batchSize (int) – The batchSize used for the evaluation process. This function processes one batch of the samples at a time. If this is None, the batch size specified for this dataset object in the __init__ function is used instead.
quiet (Boolean) – If true, no messages are printed to the “stdout” during the evaluation process.
returnMetric (Boolean) –
If true, instead of calculating all the results, just calculates the main metric of the dataset and returns that. This is mostly used during the training at the end of each epoch.
Otherwise, if this is False (the default), the full results are calculated and a dictionary of all results is returned.
**kwargs (dict) –
This contains some additional task specific arguments. Here is a list of what can be included in this dictionary.
maxSamples (int): The max number of samples from this dataSet to be processed for the evaluation of the model. If not specified, all samples are used (default behavior).
sampleIndexes (list of ints): A list of sample indexes from this dataset to be processed for the evaluation of the model. If not specified, all samples are used. (default behavior).
topK (int): For classification cases, this indicates whether a “top-K” accuracy value should also be calculated. For example for ImageNet dataset classification, usually the top-5 accuracy value is used (topK=5) besides the top-1. If it is zero (default), the top-K error is not calculated. This is ignored for regression cases.
confMat (Boolean): For classification cases, this indicates whether the confusion matrix should be calculated. If the number of classes is more than 10, this argument is ignored and confusion matrix is not calculated. This is ignored for regression cases.
expAcc (Boolean or None): Ignored for regression cases. For classification cases:
If this is a True, the expected accuracy and kappa values are also calculated. When the number of classes and/or number of evaluation samples is large, calculating expected accuracy can take a long time.
If this is False, the expected accuracy and kappa are not calculated.
If this is None (the default), then the expected accuracy and kappa are calculated only if number of classes does not exceed 10.
Note: If confMat is True, then expAcc is automatically set to True.
jsonFile (str): The name of JSON file that is created by this function. This is used with some NLP applications where the results could be saved to a JSON file for evaluation.
- Returns:
If returnMetric is True, the actual value of dataset’s main metric is returned.
Otherwise, this function returns a dictionary containing the results of the evaluation process.
- evalMultiDimRegression(model, batchSize=None, quiet=False, returnMetric=False, **kwargs)
This function evaluates the specified model using this dataset. This is currently only used for multi-dimensional regression problems if the output layer supports evaluation in the graph. (supportsEval is true)
The list of parameters and Return values for this function is the same as the “evaluateModel” function defined above.
- getMetricVal(predicted, actual)
This function calculates and returns the evaluation metric specified by “evalMetricName” property of this dataset class. It is called by the
evaluateModel()
function when the “returnMetric” argument is set to True.- Parameters:
predicted (array) – The predicted values of the output for the evaluation samples. This is a 1-D arrays of labels for Classification problems or an array of output tensors for Regression problems.
actual (array) – The actual values of the output for the evaluation samples.
- Returns:
The calculated value of the metric value for this dataset.
- Return type:
float
- evaluate(predicted, actual, topK=0, confMat=False, expAcc=None, quiet=False)
Returns information about evaluation results based on the “predicted” and “actual” values. It calls the “evaluateClassification” function for classification problems and “evaluateRegression” function for regression problems.
- Parameters:
predicted (array) – The predicted values of the output for the evaluation samples. For classification cases, if topK is not zero, this is a list of arrays each containing the K class indexes with highest probabilities in ascending order (The last one is the best) If topK is zero, then this is a 1-D arrays of predicted labels. For regression cases, this is a list of predicted tensors.
actual (array) – The actual values of the output for the evaluation samples.
topK (int) – For classification cases, this indicates whether a “top-K” accuracy value should also be calculated. For example for ImageNet dataset classification, usually the top-5 accuracy value is used (topK=5) besides the top-1. If it is zero (default), the top-K error is not calculated. This is ignored for regression cases.
confMat (Boolean) – For classification cases, this indicates whether the confusion matrix should be calculated. If the number of classes is more than 10, this argument is ignored and confusion matrix is not calculated. This is ignored for regression cases.
expAcc (Boolean or None) –
Ignored for regression cases. For classification cases:
If this is a True, the expected accuracy and kappa values are also calculated. When the number of classes and/or number of evaluation samples is large, calculating expected accuracy can take a long time.
If this is False, the expected accuracy and kappa are not calculated.
If this is None (the default), then the expected accuracy and kappa are calculated only if number of classes does not exceed 10.
Note: If confMat is True, then expAcc is automatically set to True.
quiet (Boolean) – If False, it prints the test results. The printed information includes the Confusion matrix and accuracy information for Classification problems and MSE, RMS, MAE, and PSNR for Regression problems.
- Returns:
A dictionary of the results information. See “evaluateClassification” and “evaluateRegression” functions for more information.
- Return type:
dict
- evaluateRegression(predicted, actual, quiet=False)
Returns information about test results for Regression problems based on the “predicted” and “actual” values.
This function is used only when the regression output of the model is a scaler value. For multi-dimensional regression problems the “evalMultiDimRegression” function is used.
Note
This function is not usually called directly. You should use the “evaluate” function which calls this function internally if this is a regression dataset.
- Parameters:
predicted (array) – The predicted values of the output as an arrays of output tensors.
actual (array) – The actual values of the output for the test samples.
quiet (Boolean) – If False, it prints the test results. The printed information includes MSE, RMS, MAE, and PSNR.
- Returns:
A dictionary of the results information. Here is a list of items in the results dictionary:
mse: Mean Square Error (MSE)
rmse: Root Mean Square Error (RMSE)
mae: Mean Absolute Error (MAE)
psnr: Peak Signal to Noise Ratio (PSNR)
csvItems: A list of evaluation metrics that will be included in the CSV file when performing a parameter search.
- Return type:
dict
- evaluateClassification(predicted, actual, topK=0, confMat=False, expAcc=None, quiet=False)
Returns classification metrics and draws a confusion matrix for the results based on the information in the actual and predicted labels.
- Parameters:
predicted (array) – The predicted values of the output for the evaluation samples. If topK is not zero, this is a list of arrays each containing the K class indexes with highest probabilities in ascending order (the last one is the best). If topK is zero, then this is a 1-D arrays of predicted labels.
actual (array) – The actual values of the output for the test samples.
topK (int) – This indicates whether a “top-K” accuracy value should also be calculated. For example for ImageNet dataset classification, usually the top-5 accuracy value is used (topK=5) besides the top-1. If it is zero (default), the top-K error is not calculated.
confMat (Boolean) – This indicates whether the confusion matrix should be calculated. If the number of classes is more than 10, this argument is ignored and confusion matrix is not calculated.
expAcc (Boolean or None) –
If this is a True, the expected accuracy and kappa values are also calculated. When the number of classes and/or number of evaluation samples is large, calculating expected accuracy can take a long time.
If this is False, the expected accuracy and kappa are not calculated.
If this is None (the default), then the expected accuracy and kappa are calculated only if number of classes does not exceed 10.
Note: If confMat is True, then expAcc is automatically set to True.
quiet (Boolean) – If False, it prints the test results. The printed information includes the accuracy information and the Confusion matrix if number of classes is less than 10.
- Returns:
A dictionary of the results information. Here is a list of items in the results dictionary:
accuracy: The evaluated accuracy. A float number between 0 and 1. AKA “Observed Accuracy”, it is defined as:
\[Observed Accuracy = \frac{TN+TP}{N}\]errorRate: The evaluated error rate (=1-accuracy). A float number between 0 and 1.
expectedAccuracy: Expected Accuracy of the evaluated. It is defined as:
\[ExpectedAccuracy = \frac{(TN+FN)(TN+FP)+(FP+TP)(FN+TP)}{N^2}\]kappa: kappa. It is defined as:
\[\kappa = \frac{Observed Accuracy - Expected Accuracy}{1-Expected Accuracy}\]truePositives: An array containing True Positive values (TP) for each class.
precisions: An array containing Precision values for each class. The Precision for each class it is defined as:
\[Precision = \frac{TP}{TP+FP}\]recalls: An array containing Recall values for each class. The Recall for each class it is defined as:
\[Recall = \frac{TP}{TP+FN}\]fMeasures: An array containing fMeasures values for each class. The fMeasures for each class it is defined as:
\[f = \frac{2*Precision*Recall}{Precision+Recall}\]confusionAP: A 2-D nxn array where n is the number of classes. confusionAP[a][p] is the number of samples that were predicted in class ‘p’ and are actually in class ‘a’.
top<N>Accuracy: Included only if topK is not zero. The <N> in the key name is replaced with the actual value of topK.
csvItems: A list of evaluation metrics that will be included in the CSV file when performing a parameter search.
In the above equations:
TN
True Negatives
TP
True Positives
FN
False Negatives
FP
False Positives
- Return type:
dict
- classmethod download(folderName, files, destDataFolder=None)
This class method can be called to download dataset files from their original source or a Fireball online repository. This method is usually called internally by one of the derived classes.
- Parameters:
folderName (str) – A string containing the name of the dataset folder. This is used both for Fireball repository and the destination folder name on local machine.
files (list of str) –
A list of strings. Each item in the list can be a file name or a URL.
URL: In this case the URL is tried first (which is usually the original location of the dataset files). If for any reason the file cannot be downloaded then the file name is extracted from the URL and it is downloaded from the Fireball repository.
Name: In this case the file is downloaded directly from the Fireball repository.
If the downloaded file is a zip file, it is extracted to the dataset directory.
destDataFolder (str) – The folder where dataset folders and files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there.
MNIST
This module contains the implementation of MNIST dataset for handwritten digit Classification.
Use the MnistDSetUnitTest.py
file in the UnitTest/Datasets
directory to run the Unit Test of this implementation.
- Dataset Stats
Dataset
Total Samples
Samples Per Class
Training
60000
5421 to 6742
Test
10000
892 to 1135
- class fireball.datasets.mnist.MnistDSet(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=64)
This class implements the MNIST dataset.
Constructs an MnistDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) –
The path to the directory where the dataset files are located. This implementation expects the following files in the “dataPath” directory:
t10k-images.idx3-ubyte
t10k-labels.idx1-ubyte
train-images.idx3-ubyte
train-labels.idx1-ubyte
samples (numpy array or None) – If specified, it is used as the samples for the dataset. It is a numpy array of samples. Each sample is an image represented as a numpy array of shape (28,28,1).
labels (numpy array or None) – If specified, it is a numpy array of int32 values. Each label is an int32 number between 0 and 9 indicating the class for each sample.
batchSize (int) – The default batch size used in the “batches” method.
- __init__(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=64)
Constructs an MnistDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) –
The path to the directory where the dataset files are located. This implementation expects the following files in the “dataPath” directory:
t10k-images.idx3-ubyte
t10k-labels.idx1-ubyte
train-images.idx3-ubyte
train-labels.idx1-ubyte
samples (numpy array or None) – If specified, it is used as the samples for the dataset. It is a numpy array of samples. Each sample is an image represented as a numpy array of shape (28,28,1).
labels (numpy array or None) – If specified, it is a numpy array of int32 values. Each label is an int32 number between 0 and 9 indicating the class for each sample.
batchSize (int) – The default batch size used in the “batches” method.
- classmethod download(dataFolder=None)
This class method can be called to download the MNIST dataset files.
- Parameters:
dataFolder (str) – The folder where dataset files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there. In other words, the default data folder is
~/data
- split(dsName='Valid', ratio=0.1, batchSize=None, repeatable=True)
This function splits the current dataset and returns a portion of data as a new MnistDSet object. The current object is then updated to keep the remaining samples.
This method keeps the same ratio between the number of samples for each class. This means if the original dataset was not balanced, the split datasets also are not balanced and they have the same ratio of number of samples per class.
- Parameters:
dsName (str) – The name of the new dataset that is created.
ratio (float) – The ratio between the number of samples that are removed from this dataset to the total number of the samples in this dataset before the split. The default value of 0.1 results in creating a new dataset with %10 of the samples. The remaining %90 of the samples stay in the current instance.
batchSize (int or None) – The batchSize used for the new MnistDSet object created. If not specified the new MnistDSet instance inherits the batchSize from this object.
repeatable (Boolean) – If True, the sampling from the original dataset is deterministic and therefore the experiments are repeatable. Otherwise, the sampling is done randomly.
- Returns:
A new dataset containing a portion (specified by ratio) of samples from this object.
- Return type:
- classmethod createFineTuneDataset(dataPath=None, ratio=0.1)
This class method creates a fine-tuning dataset and saves the information permanently on file system so that it can be loaded later.
- Parameters:
dataPath (str) – The path to the directory where the dataset files are located.
ratio (float) – The ratio between the number of samples in the fine-tuning dataset and total training samples. The default value of 0.1 results in creating a new dataset with %10 of the training samples.
Note
This method can be used when consistent results is required. The same dataset samples are used every time the fine-tuning algorithm uses the dataset created by this method.
- classmethod makeDatasets(dsNames='Train,Test,Valid', batchSize=32, dataPath=None)
This class method creates several datasets as specified by dsNames parameter in one-shot.
- Parameters:
dsNames (str) –
A combination of the following:
”Train”: Create the training dataset.
”Test”: Create the test dataset.
”Valid”: Create the validation dataset. The ratio of validation samples can be specified using a % sign followed by the percentage. For example “Valid%10” means create a validation dataset with %10 of the training datas.
If “Valid” is included in dsNames, there must be at least a “Train” or “Tune” dataset.
If “Tune” is included instead of “Train”, then the validation samples are taken from the fine-tuning samples. (See the examples below.)
”Tune”: Create the Fine-Tuning datas. The ratio of fine-tuning samples can be specified using a % sign followed by the percentage. For example “Tune%5” means create a fine-tuning dataset with %5 of the training datas.
If a percentage is not specified and a Tuning dataset file (created by createFineTuneDataset function) is available, the fine-tuning samples are loaded from the existing file.
batchSize (int) – The batchSize used for all the datasets created.
dataPath (str) – The path to the directory where the dataset files are located.
- Returns:
Depending on the number of items specified in the dsNames, it returns between one and three MnistDSet objects. The returned values have the same order as they appear in the dsNames parameter.
- Return type:
Up to 3 MnistDSet objects
Note
To specify the training dataset, any string containing the word “train” (case insensitive) is accepted. So, “Training”, “TRAIN”, and ‘train’ all can be used.
To specify the test dataset, any string containing the word “test” (case insensitive) is accepted. So, “testing”, “TEST”, and ‘test’ all can be used.
To specify the validation dataset, any string containing the word “valid” (case insensitive) is accepted. So, “Validation”, “VALID”, and ‘valid’ all can be used.
To specify the fine-tuning dataset, any string containing the word “tun” (case insensitive) is accepted. So, “Fine-Tuning”, “Tuning”, and ‘tune’ all can be used.
When the ‘%’ is used to specify the ratio for ‘Validation’ and ‘Tuning’ datasets, the subsampling is deterministic and the results are repeatable across different executions and even different platforms. If you want the results to be random, you can use ‘%r’ instead of ‘%’. For example “Tune%r10” creates a dataset with %10 of training data which are selected randomly. A different call on the same or different machine will probably choose a different set of samples.
Examples
dsNames="Train,Test,Valid%5"
: 3 MnistDSet objects are returned for training, test, and validation in the same order. The validation dataset contains %5 of available training data and training dataset contains the remaining %95.dsNames="Train,Test"
: 2 MnistDSet objects are returned for training and test. The training dataset contains all available training data.dsNames="FineTuning%r5,Test"
: 2 MnistDSet objects are returned for fine-tuning and test. The fine-tuning dataset contains %5 of the training data (picked randomly because of ‘%r’)dsNames="Tune%5,Test,Validation%5"
: 3 MnistDSet objects are returned for fine-tuning, test, and validation in the same order. The fine-tuning and validation together contain %5 of available training data. Validation dataset contains %5 of that (0.0025 of training data or %5 of %5) and Fine-Tuning dataset contains the remaining %95 (0.0475 of training data or %95 of %5)
CIFAR
This module contains the implementation of “CifarDSet” dataset class for image Classification (CIFAR-100). Use the CifarDSetUnitTest.py
file in the UnitTest/Datasets
folder to run the Unit Test of this implementation.
This implementation assumes that the following files exist in the ‘dataPath’ directory:
meta: Class Names.
train: Training Samples and Labels.
test: Test Samples and Labels.
- Dataset Stats
The 100 classes in the CIFAR-100 dataset are grouped into 20 superclasses. In the following table and in the API, “Fine” is used for the case where the 100-class version is used and “Coarse” is use where the 20 superclasses are used.
Dataset
Total Samples
Samples Per Class
Training
50000
500 (Fine), 2500 (Coarse)
Test
10000
100 (Fine), 500 (Coarse)
- class fireball.datasets.cifar.CifarDSet(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=64, coarseLabels=False)
This class implements the CIFAR-100 dataset.
Constructs a CifarDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) – The path to the directory where the dataset files are located.
samples (numpy array or None) – If specified, it is used as the samples for the dataset. It is a numpy array of samples. Each sample is an image represented as a numpy array of shape (32,32,3).
labels (numpy array or None) – If specified, it is a numpy array of int32 values. Each label is an int32 number between 0 and 99 (0 and 19 for the coarse dataset) indicating the class for each sample.
batchSize (int) – The default batch size used in the “batches” method.
coarseLabels (Boolean) – If True, the coarse dataset is loaded which has only 20 classes of images.
- numClasses = None
- __init__(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=64, coarseLabels=False)
Constructs a CifarDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) – The path to the directory where the dataset files are located.
samples (numpy array or None) – If specified, it is used as the samples for the dataset. It is a numpy array of samples. Each sample is an image represented as a numpy array of shape (32,32,3).
labels (numpy array or None) – If specified, it is a numpy array of int32 values. Each label is an int32 number between 0 and 99 (0 and 19 for the coarse dataset) indicating the class for each sample.
batchSize (int) – The default batch size used in the “batches” method.
coarseLabels (Boolean) – If True, the coarse dataset is loaded which has only 20 classes of images.
- classmethod download(dataFolder=None)
This class method can be called to download the CIFAR-100 dataset files from a Fireball online repository.
- Parameters:
dataFolder (str) – The folder where dataset files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there. In other words, the default data folder is
~/data
- split(dsName='Valid', ratio=0.1, batchSize=None, repeatable=True)
This function splits the current dataset and returns a portion of data as a new CifarDSet object. The current object is then updated to keep the remaining samples.
This method keeps the same ratio between the number of samples for each class. This means if the original dataset was not balanced, the split datasets also are not balanced and they have the same ratio of number of samples per class.
- Parameters:
dsName (str) – The name of the new dataset that is created.
ratio (float) – The ratio between the number of samples that are removed from this dataset to the total number of the samples in this dataset before the split. The default value of 0.1 results in creating a new dataset with %10 of the samples. The remaining %90 of the samples stay in the current instance.
batchSize (int or None) – The batchSize used for the new CifarDSet object created. If not specified the new CifarDSet instance inherits the batchSize from this object.
repeatable (Boolean) – If True, the sampling from the original dataset is deterministic and therefore the experiments are repeatable. Otherwise, the sampling is done randomly.
- Returns:
A new dataset containing a portion (specified by ratio) of samples from this object.
- Return type:
- classmethod createFineTuneDataset(dataPath=None, ratio=0.1, coarseLabels=False)
This class method creates a fine-tuning dataset and saves the information permanently on file system so that it can be loaded later.
- Parameters:
dataPath (str) – The path to the directory where the dataset files are located.
ratio (float) – The ratio between the number of samples in the fine-tuning dataset and total training samples. The default value of 0.1 results in creating a new dataset with %10 of the training samples.
coarseLabels (Boolean) – If True, the coarse dataset (with 20 classes) is loaded and used for creation of the Fine-Tuning dataset.
Note
This method can be used when consistent results is required. The same dataset samples are used every time the fine-tuning algorithm uses the dataset created by this method.
- classmethod makeDatasets(dsNames='Train,Test,Valid', batchSize=64, dataPath=None, coarseLabels=False)
This class method creates several datasets as specified by dsNames parameter in one-shot.
- Parameters:
dsNames (str) –
A combination of the following:
”Train”: Create the training dataset.
”Test”: Create the test dataset.
”Valid”: Create the validation dataset. The ratio of validation samples can be specified using a % sign followed by the percentage. For example “Valid%10” means create a validation dataset with %10 of the training datas.
If “Valid” is included in dsNames, it must be after the “Train” or “Tune”.
If “Tune” is included instead of “Train”, then the validation samples are taken from the fine-tuning samples. (See the examples below.)
”Tune”: Create the Fine-Tuning datas. The ratio of fine-tuning samples can be specified using a % sign followed by the percentage. For example “Tune%5” means create a fine-tuning dataset with %5 of the training datas.
If a percentage is not specified and a Tuning dataset file (created by createFineTuneDataset function) is available, the fine-tuning samples are loaded from the existing file.
batchSize (int) – The batchSize used for all the datasets created.
dataPath (str) – The path to the directory where the dataset files are located.
coarseLabels (Boolean) – If True, the coarse datasets are loaded which have only 20 classes of images.
- Returns:
Depending on the number of items specified in the dsNames, it returns between one and three CifarDSet objects. The returned values have the same order as they appear in the dsNames parameter.
- Return type:
Up to 3 CifarDSet objects
Note
To specify the training dataset, any string containing the word “train” (case insensitive) is accepted. So, “Training”, “TRAIN”, and ‘train’ all can be used.
To specify the test dataset, any string containing the word “test” (case insensitive) is accepted. So, “testing”, “TEST”, and ‘test’ all can be used.
To specify the validation dataset, any string containing the word “valid” (case insensitive) is accepted. So, “Validation”, “VALID”, and ‘valid’ all can be used.
To specify the fine-tuning dataset, any string containing the word “tun” (case insensitive) is accepted. So, “Fine-Tuning”, “Tuning”, and ‘tune’ all can be used.
When the ‘%’ is used to specify the ratio for ‘Validation’ and ‘Tuning’ datasets, the subsampling is deterministic and the results are repeatable across different executions and even different platforms. If you want the results to be random, you can use ‘%r’ instead of ‘%’. For example “Tune%r10” creates a dataset with %10 of training data which are selected randomly. A different call on the same or different machine will probably choose a different set of samples.
Examples
dsNames="Train,Test,Valid%5"
: 3 CifarDSet objects are returned for training, test, and validation in the same order. The validation dataset contains %5 of available training data and training dataset contains the remaining %95.dsNames="Train,Test"
: 2 CifarDSet objects are returned for training and test. The training dataset contains all available training data.dsNames="FineTuning%r5,Test"
: 2 CifarDSet objects are returned for fine-tuning and test. The fine-tuning dataset contains %5 of the training data (picked randomly because of ‘%r’)dsNames="Tune%5,Test,Validation%5"
: 3 CifarDSet objects are returned for fine-tuning, test, and validation in the same order. The fine-tuning and validation together contain %5 of available training data. Validation dataset contains %5 of that (0.0025 of training data or %5 of %5) and Fine-Tuning dataset contains the remaining %95 (0.0475 of training data or %95 of %5)
ImageNet
This module contains the implementation of ImageNet dataset for Image Recognition. This implementation assumes that the following files exist in the ‘dataPath’ directory:
ILSVRC2012Train224
: Contains 1000 folders (one per class named by the class id). The training image files in each folder are named like “Image_nnnnnn.JPEG” with nnnnnn starting at 000001.
ILSVRC2012Val224
: Contains 1000 folders (one per class named by the class id). The validation image files in each folder are named like “Image_nnnnnn.JPEG” with nnnnnn starting at 000001.
TrainDataset.csv
: Information about each class and number of training samples for each class.
ValDataset.csv
: Information about each class and number of validation samples for each class.
- Pre-Processing
Crop256Cafe
: Resize smaller dim to 256, Crop center 224, BGR output, normalized with mean: [103.939, 116.779, 123.68]ForceCafe
: Force resize to 224x224, May Lose Aspect Ratio, BGR output, normalized with mean: [103.939, 116.779, 123.68]Crop256PyTorch
: Resize smaller dim to 256, Crop center 224, RGB output, normalized to 0..1, then use mean:[0.485, 0.456, 0.406] and var: [0.229, 0.224, 0.225]ForcePyTorch
: Force resize to 224x224, May Lose Aspect Ratio, RGB output, normalized to 0..1, then use mean:[0.485, 0.456, 0.406] and var: [0.229, 0.224, 0.225]Crop256Tf
: Resize smaller dim to 256, Crop center 224, RGB output, normalized to -1..1ForceTf
: Force resize to 224x224, May Lose Aspect Ratio, RGB output, normalized to -1..1
To test this implementation, run the ImageNetDSetUnitTest.py
file in the UnitTest/Datasets
directory.
For more info see Keras image utilities and preprocessing code.
- Dataset Stats
Dataset
Total Samples
Samples Per Class
Training
1,281,167
732 to 1,300
Test
50,000
50
- class fireball.datasets.imagenet.ImageNetDSet(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=256, preProcessing='Crop256Cafe', numWorkers=8)
This class implements the ImageNet dataset.
Constructs an ImageNetDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) – The path to the directory where the dataset files are located.
samples (list or None) – If specified it is used as the samples for the dataset. It is a list of tuples containing information about the samples in the dataset. If samples is not specified, the loadSamples method is called by the base class.
labels (None) – The labels for each sample in the dataset.
batchSize (int) – The default batch size used in the “batches” method.
preProcessing (str) – The type of preprocessing used when loading the images. Please refer to the “Pre-Processing” section above in this module’s documentation for an explanation for each one of the pre-processing methods.
numWorkers (int) – The number of worker threads used to load the images.
- __init__(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=256, preProcessing='Crop256Cafe', numWorkers=8)
Constructs an ImageNetDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) – The path to the directory where the dataset files are located.
samples (list or None) – If specified it is used as the samples for the dataset. It is a list of tuples containing information about the samples in the dataset. If samples is not specified, the loadSamples method is called by the base class.
labels (None) – The labels for each sample in the dataset.
batchSize (int) – The default batch size used in the “batches” method.
preProcessing (str) – The type of preprocessing used when loading the images. Please refer to the “Pre-Processing” section above in this module’s documentation for an explanation for each one of the pre-processing methods.
numWorkers (int) – The number of worker threads used to load the images.
- classmethod download(dataFolder=None)
This class method can be called to download the ImageNet dataset files from a Fireball online repository. Please note that this does not include the training dataset. Only Tuning and Test datasets are downloaded. All image files have already been resized to 224x224.
- Parameters:
dataFolder (str) – The folder where dataset files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there. In other words, the default data folder is
~/data
- getImage(fileName)
Returns a numpy array containing the image information for the image file specified by the fileName.
- Parameters:
fileName (str) – The name of the image file.
- Returns:
The image information as a numpy array. If the preProcessing is one of ‘Crop256Cafe’ or ‘ForceCafe’, the returned value is in BGR format; otherwise it is in RGB format.
- Return type:
numpy array
- preProcessImages(images)
Preprocesses the specified image based on the method specified by preProcessing. Please refer to the “Pre-Processing” section above in this module’s documentation for an explanation for each one of the pre-processing methods.
- Parameters:
images (numpy array) – The image(s) to be pre-processed as a numpy array.
- Returns:
The processed image(s) as a numpy array.
- Return type:
numpy array
- resizedImg(img)
Resizes the specified image using the method specified by preProcessing. The resized image is always a 224x224x3 numpy array of type float32.
- Parameters:
img (numpy array) – The image to be resized.
- Returns:
The resized image as a numpy array of shape (224,224,3)
- Return type:
numpy array
- getPreprocessedImage(imageFileName)
A utility function that loads an image, resizes and preprocesses it, and returns it as a numpy array of type float32.
- Parameters:
imageFileName (str) – The path to the image file.
- Returns:
The resized and preprocessed image as a numpy array of shape (224,224,3)
- Return type:
numpy array
- split(dsName='Valid', ratio=0.1, batchSize=None)
This function splits the current dataset and returns a portion of data as a new ImageNetDSet object. This object is then updated to keep the remaining samples.
This method keeps the same ratio between the number of samples for each class. This means that if the original dataset was not balanced, the split datasets also are not balanced and they have the same ratio of number of samples per class.
- Parameters:
dsName (str) – The name of the new dataset that is created.
ratio (float) – The ratio between the number of samples that are removed from this dataset to the total number of the samples in this dataset before the split. The default value of .1 results in creating a new dataset with %10 of the samples. %90 of the samples stay in this object.
batchSize (int or None) – The batchSize used for the new ImageNetDSet object created. If not specified the new object inherits the batchSize from this object.
- Returns:
A new dataset containing a portion (specified by ratio) of samples from this object.
- Return type:
Note
This function assumes that “self.samples” is organized as follows:
All samples in class 0 All samples in class 1 ... All samples in class 999
This is the case if loadSamples method is used to load the samples.
The sampling from the original dataset is deterministic and therefore the experiments are repeatable.
- classmethod createFineTuneDataset(dataPath=None, ratio=0.1, copyImages=True)
This class method creates a fine-tuning dataset and saves the information permanently on file system so that it can be loaded later.
- Parameters:
dataPath (str) – The path to the directory where the dataset files are located.
ratio (float) – The ratio between the number of samples in the fine-tuning dataset and total training samples. The default value of .1 results in creating a new dataset with %10 of the training samples.
copyImages (Boolean) – If true, copies the images from the training images directory to the new directory “ILSVRC2012Tune224”. This is useful when access to the whole dataset is not required for fine-tuning after compression. If this is false, then the images are not copied. The new dataset reuses the original images in the training dataset.
Note
This method can be used when consistent results is required. The same dataset samples are used every time the fine-tuning algorithm uses the dataset created by this method.
- classmethod makeDatasets(dsNames='Train,Test,Valid', batchSize=256, dataPath=None, preProcessing='Crop256Cafe', numWorkers=8)
This class method creates several datasets as specified by dsNames parameter in one-shot.
- Parameters:
dsNames (str) –
A combination of the following:
”Train”: Create the training dataset.
”Test”: Create the test dataset.
”Valid”: Create the validation dataset. The ratio of validation samples can be specified using a % sign followed by the percentage. For example “Valid%10” means create a validation dataset with %10 of the training datas.
If “Valid” is included in dsNames, it must be after the “Train” or “Tune”.
If “Tune” is included instead of “Train”, then the validation samples are taken from the fine-tuning samples. See the examples below.
”Tune”: Create the Fine-Tuning datas. The ratio of fine-tuning samples can be specified using a % sign followed by the percentage. For example “Tune%5” means create a fine-tuning dataset with %5 of the training datas.
If a percentage is not specified and a Tuning dataset (created by createFineTuneDataset function) is available in the dataPath directory, the fine-tuning samples are loaded from the existing Tuning dataset.
batchSize (int) – The batchSize used for all the datasets created.
dataPath (str) – The path to the directory where the dataset files are located.
preProcessing (str) – The type of preprocessing used when loading the images. Please refer to the “Pre-Processing” section above in this module’s documentation for an explanation for each one of the pre-processing methods.
numWorkers (int) – The number of worker threads used to load the images.
- Returns:
Depending on the number of items specified in the dsNames, it returns between one and three ImageNetDSet objects. The returned values have the same order as they appear in the dsNames parameter.
- Return type:
Up to 3 ImageNetDSet objects
Notes
To specify the training dataset, any string containing the word “train” (case insensitive) is accepted. So, “Training”, “TRAIN”, and ‘train’ all can be used.
To specify the test dataset, any string containing the word “test” (case insensitive) is accepted. So, “testing”, “TEST”, and ‘test’ all can be used.
To specify the validation dataset, any string containing the word “valid” (case insensitive) is accepted. So, “Validation”, “VALID”, and ‘valid’ all can be used.
To specify the fine-tuning dataset, any string containing the word “tun” (case insensitive) is accepted. So, “Fine-Tuning”, “Tuning”, and ‘tune’ all can be used.
Examples
dsNames="Train,Test,Valid%5"
: 3 ImageNetDSet objects are returned for training, test, and validation in the same order. The validation dataset contains %5 of training data and training dataset contains the remaining %95.dsNames="Train,Test"
: 2 ImageNetDSet objects are returned for training and test.dsNames="FineTuning%5,Test"
: 2 ImageNetDSet objects are returned for fine-tuning and test. The fine-tuning dataset contains %5 of the training data.dsNames="Tune%5,Test,Validation%5"
: 3 ImageNetDSet objects are returned for fine-tuning, test, and validation in the same order. The fine-tuning and validation together contain %5 of training data. Validation dataset contains %5 of that (0.0025 of training data or %5 of %5) and Fine-Tuning dataset contains the remaining %95 (0.0475 of training data or %95 of %5)
- getBatch(batchIndexes)
This method returns a batch of samples and labels from the dataset as specified by the list of indexes in the batchIndexes parameter.
- Parameters:
batchIndexes (list of int) – A list of indexes used to pick samples and labels from the dataset.
- Returns:
samples (numpy array) – The batch samples specified by the batchIndexes. Each sample is a resize, pre-processed image as a numpy array of shape (224, 224, 3)
labels (numpy array) – The batch labels specified by the batchIndexes. A numpy array of integer values.
COCO
This module contains the implementation of COCO dataset class for object detection. Use the CocoDSetUnitTest.py
file in the UnitTest/Datasets
folder to run the Unit Test of this implementation.
A sample when it is provided in getBatch is tuple of the form: (image, classes, boxes). Internally it is kept as a tuple of the form (imageId, classes, boxes, areas, crowdFlags).
imageId
: Unique id of the Image. (int32)
image
: A numpy float32 tensor of shape (h, w, 3)
classes
: A numpy int32 tensor of shape (n,) where n is the number of objects in the image. Each number is the class index of the object. Class indexes start at ‘1’ and end with ‘80’. Class ‘0’ is reserved for background.
boxes
: A numpy float32 tensor of shape (n,4) where n is the number of objects in the image. The 4 numbers for each bbox are x1, y1, w, and h. (P1Size
format)
areas
: A numpy float32 tensor of shape (n,) where n is the number of objects in the image. Each number give the area of the object (IMPORTANT NOTE: This is different and usually smaller than the area of the box)
crowdFlags
: A numpy boolean tensor of shape (n,). Each element indicates whether the object is in a crowd of other overlapping objects.
A batch of samples is a list of samples as defined above.
This implementation assumes the following files/folders exist in the dataPath directory:
annotations
: The json files containing annotations (information about images, and objects inside each image)
train2014
: Training Images (2014)
val2014
: Validation Images (2014)
val2017
: Validation Images (2017)
- Dataset Stats
Dataset
Total
Images
Total
Objects
Crowd
Images with
no Objects
Max Objects
Per Image
Objects
Images
train2014
82,783
604,906
7,038
6,395
702
93
val2014
40,504
291,874
3,460
3131
367
70
val2017
5,000
36,781
446
411
48
63
- class fireball.datasets.coco.CocoDSet(dsName='Train', dataPath=None, batchSize=64, resolution=512, keepAr=True, numWorkers=4)
This class implements the Coco Dataset dataset.
Constructs an CocoDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, “Valid”.
dataPath (str) – The path to the directory where the dataset files are located.
batchSize (int) – The default batch size used in the “batches” method.
resolution (int) – The resolution of the images. Default is 512 for 512x512 images
keepAr (Boolean) – This specifies whether the aspect ratio of the image should be kept when it is resized.
numWorkers (int) – The number of worker threads used to load the images.
- numClasses = None
- __init__(dsName='Train', dataPath=None, batchSize=64, resolution=512, keepAr=True, numWorkers=4)
Constructs an CocoDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, “Valid”.
dataPath (str) – The path to the directory where the dataset files are located.
batchSize (int) – The default batch size used in the “batches” method.
resolution (int) – The resolution of the images. Default is 512 for 512x512 images
keepAr (Boolean) – This specifies whether the aspect ratio of the image should be kept when it is resized.
numWorkers (int) – The number of worker threads used to load the images.
- classmethod download(dataFolder=None)
This class method can be called to download the COCO dataset files.
- Parameters:
dataFolder (str) – The folder where dataset files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there. In other words, the default data folder is
~/data
- classmethod makeDatasets(dsNames='Train,Test,Valid', batchSize=64, dataPath=None, resolution=512, keepAr=True, numWorkers=4)
This class method creates several datasets as specified by dsNames parameter in one-shot.
- Parameters:
dsNames (str) –
A combination of the following:
”Train”: Create the training dataset. Training dataset uses the images in the “train2014” folder.
”Test”: Create the test dataset. Test dataset uses the images in the “val2017” folder.
”Valid”: Create the validation dataset. Validation dataset uses the images in the “val2014” folder.
batchSize (int) – The batchSize used for all the datasets created.
dataPath (str) – The path to the directory where the dataset files are located.
resolution (int) – The resolution of the images. Default is 512 for 512x512 images
keepAr (Boolean) – This specifies whether the aspect ratio of the image should be kept when it is resized.
numWorkers (int) – The number of worker threads used to load the images.
- Returns:
Depending on the number of items specified in the dsNames, it returns between one and three CocoDSet objects. The returned values have the same order as they appear in the dsNames parameter.
- Return type:
Up to 3 CocoDSet objects
Note
To specify the training dataset, any string containing the word “train” (case insensitive) is accepted. So, “Training”, “TRAIN”, and ‘train’ all can be used.
To specify the test dataset, any string containing the word “test” (case insensitive) is accepted. So, “testing”, “TEST”, and ‘test’ all can be used.
To specify the validation dataset, any string containing the word “valid” (case insensitive) is accepted. So, “Validation”, “VALID”, and ‘valid’ all can be used.
Examples
dsNames="Train,Test,Valid"
: 3 ImageNetDSet objects are returned for training, test, and validation in the same order.dsNames="TRAINING,TEST"
: 2 ImageNetDSet objects are returned for training and test.
- getStats()
Returns some statistics about this instance of CocoDSet.
- Returns:
sampleCounts (numpy array) – This is 2d numpy array.
sampleCounts[c][1]
is the number of images in the whole dataset that contains an instance of class ‘c’.sampleCounts[c][0]
is the total number of times instances of class ‘c’ appear in all images in the dataset.numCrowd (list) – This is a list containing 2 integer numbers.
numCrowd[0]
is the total number of times a “Crowd” object appears in the whole dataset.numCrowd[1]
is the number of images in the dataset that contain at least one “Crowd” object.numEmptyImages (int) – This is the number of images in the dataset that don’t contain any objects in it.
maxObjectsPerImage (int) – This is the maximum number of objects that appeared in an image in the whole dataset.
- classmethod printStats(trainDs=None, testDs=None, validDs=None)
This class method prints statistics of classes for the given set of datasets in a single table.
- getImage(img)
This returns an image in BGR format as a numpy array of type float32 and shape (h,w,3).
- Parameters:
img (numpy array, int/np.int32, or str) –
If this is a numpy array, it is assumed that the image has already been loaded and it is just returned without any modifications.
If this is an int/np.int32, it is assumed to be the id of the image and it is used to get the image file name and then load the image from the file.
If this is a str, then it is assumed to be the name of the file and it is used to load the image.
- Returns:
The loaded image is returned in BGR format as a numpy array of type float32 and shape (h,w,3). Where ‘w’ and ‘h’ are equal to the ‘resolution’ argument in the ‘__init__’ or ‘makeDatasets’ functions.
- Return type:
numpy array
- classmethod p1P2ToP1Size(boxes)
Convert from [x1, y1, x2, y2] to [x1, y1, w, h]
This class method changes all the boxes in the “boxes” array from “P1P2” format to “P1Size” format.
- Parameters:
boxes (1-D or 2D list or numpy array of ints or floats) – This contains one or more boxes in “P1P2” format. In “P1P2” format each box is represented with array of 4 numbers [x1, y1, x2, y2] where (x1,y1) and (x2,y2) represent the top-left and bottom-right corners of the box correspondingly.
- Returns:
The box(s) in the “P1Size” format. In “P1Size” format each box is represented with array of 4 numbers [x1, y1, w, h] where (x1,y1) and (w,h) represent the top-left corner of the box and size of the box correspondingly.
- Return type:
same shape and type of the input
- classmethod p1SizeToP1P2(boxes)
Convert from [x1, y1, w, h] to [x1, y1, x2,y 2]
This class method changes all the boxes in the “boxes” array from “P1Size” format to “P1P2” format.
- Parameters:
boxes (1-D or 2D list or numpy array of ints or floats) – This contains one or more boxes in “P1Size” format. In “P1Size” format each box is represented with array of 4 numbers [x1, y1, w, h] where (x1,y1) and (w,h) represent the top-left corner of the box and size of the box correspondingly.
- Returns:
The box(s) in the “P1P2” format. In “P1P2” format each box is represented with array of 4 numbers [x1, y1, x2, y2] where (x1,y1) and (x2,y2) represent the top-left and bottom-right corners of the box correspondingly.
- Return type:
same shape and type of the input
- classmethod p1P2ToCenterSize(boxes)
Convert from [x1, y1, x2, y2] to [cx, cy, w, h]
This class method changes all the boxes in the “boxes” array from “P1P2” format to “CenterSize” format.
- Parameters:
boxes (1-D or 2D list or numpy array of ints or floats) – This contains one or more boxes in “P1P2” format. In “P1P2” format each box is represented with array of 4 numbers [x1, y1, x2, y2] where (x1,y1) and (x2,y2) represent the top-left and bottom-right corners of the box correspondingly.
- Returns:
The box(s) in the “CenterSize” format. In “CenterSize” format each box is represented with array of 4 numbers [cx, cy, w, h] where (cx,cy) and (w,h) represent the center point and size of the box correspondingly.
- Return type:
same shape and type of the input
- classmethod centerSizeToP1P2(boxes)
Convert from [cx, cy, w, h] to [x1, y1, x2, y2]
This class method changes all the boxes in the “boxes” array from “CenterSize” format to “P1P2” format.
- Parameters:
boxes (1-D or 2D list or numpy array of ints or floats) – This contains one or more boxes in “CenterSize” format. In “CenterSize” format each box is represented with array of 4 numbers [cx, cy, w, h] where (cx,cy) and (w,h) represent the center point and size of the box correspondingly.
- Returns:
The box(s) in the “P1P2” format. In “P1P2” format each box is represented with array of 4 numbers [x1, y1, x2, y2] where (x1,y1) and (x2,y2) represent the top-left and bottom-right corners of the box correspondingly.
- Return type:
same shape and type of the input
- classmethod p1SizeToCenterSize(boxes)
Convert from [x1, y1, w, h] to [cx, cy, w, h]
This class method changes all the boxes in the “boxes” array from “P1Size” format to “CenterSize” format.
- Parameters:
boxes (1-D or 2D list or numpy array of ints or floats) – This contains one or more boxes in “P1Size” format. In “P1Size” format each box is represented with array of 4 numbers [x1, y1, w, h] where (x1,y1) and (w,h) represent the top-left corner and size of the box correspondingly.
- Returns:
The box(s) in the “CenterSize” format. In “CenterSize” format each box is represented with array of 4 numbers [cx, cy, w, h] where (cx,cy) and (w,h) represent the center point and size of the box correspondingly.
- Return type:
same shape and type of the input
- classmethod centerSizeToP1Size(boxes)
Convert from [cx, cy, w, h] to [x1, y1, w, h]
This class method changes all the boxes in the “boxes” array from “CenterSize” format to “P1Size” format.
- Parameters:
boxes (1-D or 2D list or numpy array of ints or floats) – This contains one or more boxes in “CenterSize” format. In “CenterSize” format each box is represented with array of 4 numbers [cx, cy, w, h] where (cx,cy) and (w,h) represent the center point and size of the box correspondingly.
- Returns:
The box(s) in the “P1P2” format. In “P1Size” format each box is represented with array of 4 numbers [x1, y1, w, h] where (x1,y1) and (w,h) represent the top-left corner of the box and size of the box correspondingly.
- Return type:
same shape and type of the input
- classmethod scaleImageAndBoxes(img, boxes=None, res=512, keepAr=True, boxFormat='P1Size')
This class method scales the specified image to a square res x res image. If keepAr is true, the aspect ratio is kept by padding the smaller dimension with zeros (black). It then scales all boxes specified in the boxes using the same ratio used to scale the image.
- Parameters:
img (numpy array) – The image as a numpy array of shape (h,w,3)
boxes (numpy array or None) – A set of bounding boxes for the objects present in the image. The boxes are stored using the format specified by boxFormat. It can be None or empty which indicates there are no boxes to be scaled.
res (int) – The target resolution to scale to. The returned value is a square res x res image.
keepAr (Boolean) – This specifies whether the aspect ratio of the image should be kept when it is scaled.
boxFormat (str) – This specifies the format of the boxes. See the box formats in the Notes section below.
- Returns:
resizedImg (numpy array) – The scaled image as a numpy array of shape (res,res,3)
modifiedBoxes (numpy array or None) – The scaled boxes as a numpy array (Same shape and type as boxes) or None if boxes is None.
imgSize (tuple) – The size of original image as a 2-tuple (w,h).
Note
The boxFormat specifies the format of the boxes. It can be one of the following:
P1Size: The boxes are [x1, y1, w, h] with (x1,y1) and (w,h) as top-left corner and size of the box correspondingly.
CenterSize: The boxes are [cx, cy, w, h] with (cx,cy) and (w,h) as center point and size of the box correspondingly.
P1P2: The boxes are [x1, y1, x2, y2] with (x1,y1) and (x2,y2) as top-left and bottom-right corners of the boxes correspondingly.
- classmethod flipImageHorizontally(img, boxes, boxFormat='P1Size')
This class method flips an image horizontally. It then moves all the boxes specified in the boxes so that they bound the original objects in the flipped images.
- Parameters:
img (numpy array) – The image as a numpy array of shape (h,w,3)
boxes (numpy array or None) – A set of bounding boxes for the objects present in the image. The boxes are stored using the format specified by boxFormat. It can be None or empty which indicates there are no boxes to be flipped.
boxFormat (str) – This specifies the format of the boxes. See the box formats in the Notes section below.
- Returns:
flippedImg (numpy array) – The flipped image as a numpy array of shape (res,res,3)
flippedBoxes (numpy array or None) – The flipped boxes as a numpy array (Same shape and type as boxes) or None if boxes is None.
Note
The boxFormat specifies the format of the boxes. It can be one of the following:
P1Size: The boxes are [x1, y1, w, h] with (x1,y1) and (w,h) as top-left corner and size of the box correspondingly.
CenterSize: The boxes are [cx, cy, w, h] with (cx,cy) and (w,h) as center point and size of the box correspondingly.
P1P2: The boxes are [x1, y1, x2, y2] with (x1,y1) and (x2,y2) as top-left and bottom-right corners of the boxes correspondingly.
- classmethod zoomOutAndMove(img, boxes, newRes, res, offset=(0, 0), boxFormat='P1Size')
This class method scales the image down and then move the smaller image by an offset specified by offset. The return value is a black res x res image that contains a newRes x newRes image at a location that is specified by offset (from the center of the image).
It then scales and moves all the boxes specified in the boxes so that they bound the original objects in the resized/moved images.
- Parameters:
img (numpy array) – The image as a numpy array of shape (h,w,3)
boxes (numpy array or None) – A set of bounding boxes for the objects present in the image. The boxes are stored using the format specified by boxFormat. It can be None or empty which indicates there are no boxes to be scaled/moved.
newRes (int) – The new resolution of the image. The returned image contains a newRes x newRes image.
res (int) – The resolution of the returned image.
offset (tuple) – This 2-tuple specifies the offset of the scaled image from the center point of the image.
boxFormat (str) – This specifies the format of the boxes. See the box formats in the Notes section below.
- Returns:
movedImg (numpy array) – The scaled/moved image as a numpy array of shape (res,res,3)
movedBoxes (numpy array or None) – The scaled/moved boxes as a numpy array (Same shape and type as boxes) or None if boxes is None.
Note
The boxFormat specifies the format of the boxes. It can be one of the following:
P1Size: The boxes are [x1, y1, w, h] with (x1,y1) and (w,h) as top-left corner and size of the box correspondingly.
CenterSize: The boxes are [cx, cy, w, h] with (cx,cy) and (w,h) as center point and size of the box correspondingly.
P1P2: The boxes are [x1, y1, x2, y2] with (x1,y1) and (x2,y2) as top-left and bottom-right corners of the boxes correspondingly.
- setAcnchorBoxes(anchorBoxes)
Sets the anchor boxes for this dataset. The anchor boxes are created by an SSD model and passed to this class so that they can be used during training and evaluation of the models.
- Parameters:
anchorBoxes (numpy array) – An nx4 numpy array, where n is the number of anchor boxes. The boxes are in “CenterSize” format [cx, cy, w, h] with all values normalized to an image size of 1.0 x 1.0.
- getGroundTruth(labels, boxes)
This function receives lists of objects and their locations on an image and creates the ground-truth information used for training a model. The ground-truth information includes the label and location information for each one of the anchor boxes defined by the model.
- Parameters:
labels (numpy array) – This 1-D array contains the class of each object in an image.
boxes (numpy array) – This 2-D matrix contains the box information for each object in an image. The boxes are in “P1Size” format. The number of boxes in this array should match the number of labels in the labels parameter.
- Returns:
gtLabels (numpy array) – The label for each anchor box. The number of items in the array is equal to the number of anchor boxes defined by the model. (See setAcnchorBoxes)
gtBoxAdj (numpy array) – This the adjustment applied to each anchor box to match it to one of the ground-truth boxes in the image. Shape: (numAnchors, 4)
gtMask (numpy array) – A Foreground/background indicator for each anchor box: 1->Foreground, -1->Background, 0->Neutral
gtIous (numpy array) – This is a 1-D array of IOU values for each anchor box. For the i’th anchor box, this function finds the box in the boxes array that has the highest IOU with the anchor box. It then sets gtIous[i] to this maximum value.
- classmethod showImageAndBoxes(image, boxes, labels, title=None)
This class method shows the specified image and the bounding boxes in a matplotlib.pyplot diagram.
- Parameters:
image (numpy array) – The image as a numpy array of shape (height, width, 3) in BGR format.
boxes (numpy array or list) – Each item in boxes represents a single box in “P1Size” format [x,y,w,h].
labels (numpy array or list) – The class of each object in the image. This is used to look up the class name and show it as caption for the object on the image. Th number of labels should match the number of boxes.
title (str) – The title used for the displayed image.
Note
This function blocks current thread until the user closes the image window manually.
- showSample(sampleIndex=None, sampleId=None, title=None)
This function shows one of the samples in this dataset as specified by the sampleIndex or sampleId.
- Parameters:
sampleIndex (int or None) – The sample index. This index is used to get the specified sample in the dataset.
sampleId (int or None) – The sample identifier. This is used to find the sample index. The sample index can then be used get the specified sample in the dataset.
title (str) – The title used for the displayed image.
Note
If sampleIndex is specified sampleId is ignored, otherwise sampleId must be specified.
This function blocks current thread until the user closes the image window manually.
- classmethod showInferResults(image, boxes=[], labels=[], scores=[], arKept=True, title=None)
This function shows the results of inference. First the image is sent to the model to detect all the objects in the image. The detected information include bounding boxes, the labels, and scores (or confidence factors) for each detected object. The information is then passed to this function to display the image together with the detected objects.
- Parameters:
image (numpy array) – The image as a numpy array of shape (height, width, 3) in BGR format.
boxes (numpy array or list) – The detected boxes in P1Size format with normalized (between 0 and 1) coordinates. The number of labels, scores, and boxes, should match.
labels (numpy array or list) – The class of each detected object in the image. This is used to look up the class name and show it as caption for each detected object in the image. The number of labels, scores, and boxes, should match.
scores (numpy array or list) – The score (or confidence) for each detected object in the image. The number of labels, scores, and boxes, should match.
arKept (Boolean) – If True, it means the aspect ratio of the image was kept when it was fed to the model for inference.
title (str) – The title used for the displayed image.
Note
This function blocks current thread until the user closes the image window manually.
- randomMutate(img, boxes)
This function randomly mutates the specified image. It can be used for data augmentation to improve the training of the model.
Currently 2 types of mutations are supported:
Horizontal Flip (%25 probability)
Zoom out and move (%25 probability)
Also %50 of the time the original image is returned without any modifications. Please refer to flipImageHorizontally and zoomOutAndMove functions for more detail about these mutations.
- Parameters:
img (numpy array) – The image as a numpy array of shape (height, width, 3) in BGR format.
boxes (numpy array or list) – Each item in boxes represents a single box in “P1Size” format [x,y,w,h].
- Returns:
mutatedImg (numpy array) – The modified image as a numpy array (same shape and format as the original image)
mutatedBoxes (numpy array) – The modified boxes as a numpy array (Same shape and type as boxes).
- getBatch(batchIndexes)
This method returns a batch of samples and labels from the dataset as specified by the list of indexes in the batchIndexes parameter.
If the anchor boxes are already given to this dataset (See setAcnchorBoxes) then it is assumed we are training a model. In this case this function provides the pre-processed image together with ground-truth information (see getGroundTruth).
For the training case, you can enable data augmentation using the augment property. By default this is set to False.
If the anchor boxes are not available, it is assumed we are in inference mode. In this case the pre-processed images are returned together with a list of tuples (imageId, imageSize).
- Parameters:
batchIndexes (list of int) – A list of indexes used to pick samples from the dataset.
- Returns:
images (numpy array) – The batch images specified by the batchIndexes. Each image is resized, pre-processed, and possibly mutated and returned as a numpy array.
labels (list of 2-tuples or a 3-tuple of numpy arrays) –
If training, a 3-tuple is used as “labels” for the batch. The tuple contains the following items:
Ground-truth labels
Ground-truth box adjustments
Ground-truth background masks
If inferring, a list of tuples (imageId, imageSize).
- evaluate(results, isTraining=False, quiet=False)
This function returns the results of inference and evaluates the results. It then returns the result statistics together with a human readable text string containing details of evaluation results in the form of a table.
- Parameters:
results (list of tuples) –
The results is a list of tuples [ (IMG0, DT0), (IMG1, DT1), … ] where:
IMGi : The i’th image id.
DTi : The predicted information for the i’th image. It is a 3-tuple containing numpy arrays: ([CLASS0, CLASS1, …], [BOX0, BOX1, …], [SCORE0, SCORE1, …])
CLASSj : The class index (between 1 and numClasses - 0/background should not appear here).
BOXj : The predicted bounding box for the j’th object detected in the i’th image in P1P2 format.
SCOREj : The score (or confidence) for the j’th detected object.
The classes, boxes, and scores are sorted based on the descending order of scores. (The best predictions appear first)
isTraining (Boolean) – True means this function is called during the training (at the end of each epoch) to show the intermediate results during the training. In this case the results are calculated only for a single combination of area/maxDet.
quiet (Boolean) – If true, this function shows the progress during the calculations and prints the details of results in the form of a table. Otherwise, this function does not print anything during the process.
- Returns:
ap50 (numpy array) – A numpy array containing the average precision for all combinations of area/maxDet calculated with IOU threshold of .50.
ap75 (numpy array) – A numpy array containing the average precision for all combinations of area/maxDet calculated with IOU threshold of .75.
ap (numpy array) – A numpy array containing the average precision for all combinations of area/maxDet averaged over IOU threshold values 0.50, 0.55, … 0.95.
ar (numpy array) – A numpy array containing the average recall for all combinations of area/maxDet averaged over IOU threshold values 0.50, 0.55, … 0.95.
- evaluateModel(model, batchSize=None, quiet=False, returnMetric=False, **kwargs)
This function evaluates the specified model using this dataset.
- Parameters:
model (Fireball Model object) – The model being evaluated.
batchSize (int) – The batchSize used for the evaluation process. This function processes one batch of the samples at a time. If this is None, the batch size specified for this dataset object in the __init__ function is used instead.
quiet (Boolean) – If true, no messages are printed to the “stdout” during the evaluation process.
returnMetric (Boolean) –
If true, instead of calculating all the results, just calculates the main metric of the dataset and returns that. This is mostly used during the training at the end of each epoch.
Otherwise, if this is False (the default), the full results are calculated and a dictionary of all results is returned.
**kwargs (dict) –
This contains some additional task specific arguments. Here is a list of what can be included in this dictionary.
maxSamples (int): The max number of samples from this dataSet to be processed for the evaluation of the model. If not specified, all samples are used (default behavior).
- Returns:
If returnMetric is True, the actual value of dataset’s main metric (mAP) is returned.
Otherwise, this function returns a dictionary containing the results of the evaluation process.
SQuAD
This module contains the implementation of SQuAD dataset class for NLP Question-Answering tasks.
Use the SquadDSetUnitTest.py
file in the UnitTest/Datasets
folder to run the Unit Test of this implementation.
This implementation assumes that the following files exist in the ‘dataPath’ directory:
train-v1.1.json
: The training dataset for SQuAD version 1
dev-v1.1.json
: The evaluation dataset for SQuAD version 1
train-v2.0.json
: The training dataset for SQuAD version 2
dev-v2.0.json
: The evaluation dataset for SQuAD version 2
Note
When training, the actual answer text is not used. Only start and end tokens are included in the labels.
When evaluating, the start and end positions are not used and the answer text is used to compare with the predicted answer.
If there is a
vocab.txt
file in the ‘dataPath’ directory and no tokenizer is specified, this file is used to create a tokenizer.The first time each JSON file is read, the tokenized information is saved in FNJ files. The FNJ files are used every time after that which makes the loading process faster.
While reading JSON files some of the samples are dropped because one of the following reasons:
The question length is too short (less than 4 tokens)
Questions with no answers in SQuAD version 1.
Answer tokens could not be found in the context. (See the method “shouldDrop”)
Dataset Stats
Version 1:
Parameter
Training
Evaluation
Comments
Num Samples
87844
10833
Total number of samples (segmented
contexts counted multiple times).
Num Questions
87599
10570
Total number of questions in the
dataset. (Some questions ignored)
Num Questions Kept
87451
10570
Number of questions kept.
Num Answers
87844
35556
Total Number of answers (Multiple
answers for same question counted
multiple times
Num Contexts
18896
2067
Total number of context paragraphs.
Num Titles
442
48
Total number of subjects (titles)
Max Context Len
853
789
Maximum length of context paragraphs.
Max Question Len
61
38
Maximum length of questions.
Num Impossible
0
0
Number of questions with no answer
Max Num Answers
1
6
Maximum number of answers for a
question.
Num Segmented
893
183
Number of times a context paragraph
was segmented because it was too
long. This is based on the following
segmentation params:
maxSeqLen = 384
stride = 128
maxQuestionLen = 64
Version 2:
Parameter
Training
Evaluation
Comments
Num Samples
131805
12232
Total number of samples (segmented
contexts are counted multiple times).
Num Questions
130319
11873
Total number of questions in the
original dataset. (Some questions are
ignored)
Num Questions Kept
130184
11873
Number of questions kept.
Num Answers
87074
20850
Total Number of answers (Multiple
answers forthe same question counted
multiple times.
Num Contexts
19035
1204
Total number of context paragraphs.
Num Titles
442
35
Total number of subjects (titles)
Max Context Len
853
789
Maximum length of context paragraphs.
Max Question Len
61
38
Maximum length of questions.
Num Impossible
44731
6129
Number of questions with no answer
(segmented samples are counted
multiple times)
Max Num Answers
1
6
Maximum number of answers for a
question.
Num Segmented
1373
210
Number of times a context paragraph
was segmented because it was too long.
This is based on the following
segmentation params:
maxSeqLen = 384
stride = 128
maxQuestionLen = 64
- class fireball.datasets.squad.SquadDSet(dsName='Train', dataPath=None, batchSize=8, version=2, tokenizer=None, numWorkers=0)
This class implements the SQuAD dataset.
Constructs a SquadDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str, optional) – The name of the dataset. It can be one of “Train” or “Test”.
dataPath (str, optional) – The path to the directory where the dataset files are located.
batchSize (int, optional) – The default batch size used in the “batches” method.
version (int, optional) – The SQuAD version of the dataset. It can be 1 or 2.
tokenizer (Tokenizer object, optional) – The tokenizer used to tokenize the text info in the dataset files.
numWorkers (int, optional) – If numWorkers is more than zero, “numWorkers” worker threads are created to process and prepare future batches in parallel.
- __init__(dsName='Train', dataPath=None, batchSize=8, version=2, tokenizer=None, numWorkers=0)
Constructs a SquadDSet instance. This can be called directly or via makeDatasets class method.
- Parameters:
dsName (str, optional) – The name of the dataset. It can be one of “Train” or “Test”.
dataPath (str, optional) – The path to the directory where the dataset files are located.
batchSize (int, optional) – The default batch size used in the “batches” method.
version (int, optional) – The SQuAD version of the dataset. It can be 1 or 2.
tokenizer (Tokenizer object, optional) – The tokenizer used to tokenize the text info in the dataset files.
numWorkers (int, optional) – If numWorkers is more than zero, “numWorkers” worker threads are created to process and prepare future batches in parallel.
- classmethod download(dataFolder=None)
This class method can be called to download the SQuAD dataset files.
- Parameters:
dataFolder (str) – The folder where dataset files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there. In other words, the default data folder is
~/data
- classmethod makeDatasets(dsNames='Train,Test', batchSize=8, dataPath=None, version=2, tokenizer=None, numWorkers=0)
This class method creates several datasets in one-shot as specified by dsNames parameter.
- Parameters:
dsNames (str) –
A combination of the following:
”Train”: Create the training dataset.
”Test”: Create the test dataset.
batchSize (int) – The batchSize used for all the datasets created.
dataPath (str) – The path to the directory where the dataset files are located.
version (int) – The version of SQuAD dataset (1 or 2)
tokenizer (Tokenizer object) – The tokenizer used by all created datasets. If this is None, and there is a “vocab.txt” file in the “dataPath”, this method tries to create a tokenizer using “vocab.txt” as its vocabulary.
- Returns:
Depending on the number of items specified in the dsNames, it returns one or two SquadDSet objects. The returned values have the same order as they appear in the dsNames parameter.
- Return type:
Up to 2 SquadDSet objects
- classmethod printDsInfo(trainDs=None, testDs=None)
This class method prints information about given set of datasets in a single table.
- classmethod printStats(trainDs=None, testDs=None)
This class method prints dataset statistics for the given set of datasets in a single table.
- getBatch(batchIndexes)
This method returns a batch of samples and labels from the dataset as specified by the list of indexes in the batchIndexes parameter.
- Parameters:
batchIndexes (list of int) – A list of indexes used to pick samples and labels from the dataset.
- Returns:
samples (tuple of 3 items:) –
batchTokenIds: 2D list of integer tokenId values for ‘n’ sequences (where ‘n’ is current batch size). Each sequence (row) contains ‘maxSeqLen’ tokens including tokens for the question, the context, and paddings
batchTokenTypes: 2D list of integers one for each tokenId in “batchTokenIds”. ‘0’ is used for question tokens and ‘1’ for context (or a segment of context). ‘0’ is also used for the padding tokens.
labels (tuple of 2 items) – For training mode, the label contains the following lists:
batchStartPos: For each sample in the batch, this contains the position of the first token of ground-truth answer.
batchStartPos: For each sample in the batch, this contains the position of the last token of ground-truth answer.
For evaluation mode, the label contains the following lists:
batchAnswerTexts: For each sample in the batch, this is a list of text strings containing the possible answers to the questions.
batchTokenInMaxContext: For each sample in the batch, this is a boolean list. Each item in the list indicates whether the corresponding token in the sequence is in its maximum context. This is used when the answer may appear in different segments of the same context. We only should consider the results when the token is in its max context.
- evaluateModel(model, batchSize=None, quiet=False, returnMetric=False, **kwargs)
This function evaluates the specified model using this dataset.
- Parameters:
model (Fireball Model object) – The model being evaluated.
batchSize (int) – The batchSize used for the evaluation process. This function processes one batch of the samples at a time. If this is None, the batch size specified for this dataset object in the __init__ function is used instead.
quiet (Boolean) – If true, no messages are printed to the “stdout” during the evaluation process.
returnMetric (Boolean) –
If true, instead of calculating all the results, just calculates the main metric of the dataset and returns that. This is mostly used during the training at the end of each epoch. The main metric for the SQuAD dataset is the exact match accuracy.
Otherwise, if this is False (the default), the full results are calculated and a dictionary of all results is returned.
**kwargs (dict) –
This contains some additional task specific arguments. Here is a list of what can be included in this dictionary.
maxSamples (int): The max number of samples from this dataSet to be processed for the evaluation of the model. If not specified, all samples are used (default behavior).
jsonFile (str): If specified, this is the name of JSON file that is created by this function.
- Returns:
If returnMetric is True, the actual value of dataset’s main metric is returned.
Otherwise, this function returns a dictionary containing the results of the evaluation process.
- evaluate(predictions, noAnswerProbThreshold=1.0, quiet=False)
Returns information about evaluation results based on the “predicted” values. This function is usually called by the
evaluateModel
function. But it can also be called when we have a the prediction results as a dictionary.If you want to evaluate the prediction results from a JSON file, the name of JSON file can also be passed in prediction.
- Parameters:
predictions (dict or str) –
If
predictions
is a dictionary, it should have the text string for each question Id in this dataset. In other words, the keys are the question IDs in this dataset and the values are the actual answer text string predicted by the model.If
predictions
is a str, it should be the name of a JSON file containing the prediction results. The JSON file should contain a dictionay with a format as explained above.
noAnswerProbThreshold (float) – If the predicted probability of not having an answer for a question is more than this value, then we assume that the prediction is no-answer. In this case we consider this an exact match if the
impossible
flag for this question is set in the dataset and a mismatch otherwise.quiet (Boolean) – If False, it prints the test results. The printed information includes the Confusion matrix and accuracy information for Classification problems and MSE, RMS, MAE, and PSNR for Regression problems.
- Returns:
A dictionary of the results information. Here is a list of items in the results dictionary:
exact: The exact match accuracy. A float number between 0 and 1.
f1: The F1 value which is calculated based on how similar the predicted and the ground-truth answer texts are.
numQuestions: Total number of questions involved in the evaluation
hasAnsExact: The exact match accuracy for the questions that actually have an answer (Not impossible)
hasAnsF1: The F1 value for the questions that actually have an answer (Not impossible)
numHasAns: The number of questions that have an answer (Not Impossible) involved in the evaluation
noAnsExact: The exact match accuracy for the impossible questions.
noAnsF1: The F1 value for the impossible questions.
numNoAns: The number of impossible questions involved in the evaluation
csvItems: A list of evaluation metrics that will be included in the CSV file when performing a parameter search.
- Return type:
dict
GLUE
This module contains the implementation of GLUE dataset class for different NLP tasks. Use the GlueDSetUnitTest.py
file in the UnitTest/Datasets
folder to run the Unit Test of this implementation.
This implementation assumes that the following folders exist in the ‘dataPath’ directory for each one of supported GLUE tasks. For more information about GLUE tasks please refer to: https://gluebenchmark.com/tasks
CoLA
: The Corpus of Linguistic Acceptability
SST-2
: The Stanford Sentiment Treebank
MRPC
: Microsoft Research Paraphrase Corpus
STS-B
: Semantic Textual Similarity Benchmark
QQP
: Quora Question Pairs
MNLI
: MultiNLI (Matched and Mismatched)
QNLI
: Question NLI
RTE
: Recognizing Textual Entailment
WNLI
: Winograd NLI
SNLI
: Stanford NLI Corpus (Not officially part of GLUE)
AX
: Auxiliary Task (GLUE Diagnostic Dataset)
Note
If there is a “vocab.txt” file in the ‘dataPath’ directory and no tokenizer is specified, this file is used to create a tokenizer.
While reading dataset files some of the samples may be dropped if the sequence length exceeds the ‘maxSeqLen’
Tasks
- CoLA
The Corpus of Linguistic Acceptability (Metric: Matthew’s Corr)
Class
Training Samples
Dev Samples
Test Samples
0 Unacceptable
1 Acceptable
2528 (29.56% )
6023 (70.44%)
322 (30.87%)
721 (69.13%)
Unknown
Unknown
Total
8551 (80.24% )
1043 (9.79%)
1063 (9.97%)
Max Seq. len
Samples Dropped
47
0
35
0
38
0
- SST-2
The Stanford Sentiment Treebank (Metric: Accuracy)
Class
Training Samples
Dev Samples
Test Samples
0 Negative
1 Positive
29780 (44.22%)
37569 (55.78%)
428 (49.08%)
444 (50.92%)
Unknown
Unknown
Total
67349 (96.16%)
872 (1.24%)
1821 (2.60%)
Max Seq. len
Samples Dropped
66
0
55
0
64
0
- MRPC
Microsoft Research Paraphrase Corpus (Metric: F1/Accuracy)
Class
Training Samples
Dev Samples
Test Samples
0 Irrelevant
1 Equivalent
1194 (32.55%)
2474 (67.45%)
129 (31.62%)
279 (68.38%)
Unknown
Unknown
Total
3668 (63.23%)
408 (7.03%)
1725 (29.74%)
Max Seq. len
Samples Dropped
103
0
86
0
104
0
- STS-B
Semantic Textual Similarity Benchmark (Metric: Pearson-Spearman Corr)
Parameter
Training Samples
Dev Samples
Test Samples
Total Count
Max Seq. len
Samples Dropped
5749
125
0
1500
87
0
1379
81
0
- QQP
Quora Question Pairs (Metric: F1/Accuracy)
NOTE: This dataset contains some invalid records in train and dev files which are ignored.
Class
Training Samples
Dev Samples
Test Samples
0 Different
1 Duplicate
229471 (63.07%)
134378 (36.93%)
25545 (63.18%)
14885 (36.82%)
Unknown
Unknown
Total
363849 (45.75%)
40430 (5.08%)
390965 (49.16%)
Max Seq. len
Samples Dropped
330
0
199
0
319
0
- MNLI-m
MultiNLI (Matched) (Metric: Accuracy)
Class
Training Samples
Dev Samples
Test Samples
0 Contradiction
1 Entailment
2 Neutral
130903 (33.33%)
130899 (33.33%)
130900 (33.33%)
3213 (32.74%)
3479 (35.45%)
3123 (31.82%)
Unknown
Unknown
Unknown
Total
392702 (95.24%)
9815 (2.38%)
9796 (2.38%)
Max Seq. len
Samples Dropped
444
0
237
0
249
0
- MNLI-mm
MultiNLI (Mismatched) (Metric: Accuracy)
NOTE: This task is only available for ‘dev’ and ‘test’ datasets. It should be used with the model trained for the
MNLI-m
task.Class
Dev Samples
Test Samples
0 Contradiction
1 Entailment
2 Neutral
3240 (32.95%)
3463 (35.22%)
3129 (31.82%)
Unknown
Unknown
Unknown
Total
9832 (2.38%)
9847 (2.39%)
Max Seq. len
Samples Dropped
211
0
262
0
- QNLI
Question NLI (Metric: Accuracy)
Class
Training Samples
Dev Samples
Test Samples
0 NotEntailment
1 Entailment
52366 (50.00%)
52372 (50.00%)
2761 (50.54%)
2702 (49.46%)
Unknown
Unknown
Total
104738 (90.55%)
5463 (4.72%)
5463 (4.72%)
Max Seq. len
Samples Dropped
307
5
250
0
294
0
Note: Using maxSeqLen=384, 5 training samples are dropped because they result in longer sequences.
- RTE
Recognizing Textual Entailment (Metric: Accuracy)
Class
Training Samples
Dev Samples
Test Samples
0 NotEntailment
1 Entailment
1241 (49.84%)
1249 (50.16%)
131 (47.29%)
146 (52.71%)
Unknown
Unknown
Total
2490 (43.18%)
277 (4.80%)
3000 (52.02%)
Max Seq. len
Samples Dropped
289
0
253
0
252
0
- WNLI
Winograd NLI (Metric: Accuracy)
Class
Training Samples
Dev Samples
Test Samples
0 NotEntailment
1 Entailment
323 (50.87%)
312 (49.13%)
40 (56.34%)
31 (43.66%)
Unknown
Unknown
Total
635 (74.53%)
71 (8.33%)
146 (17.14%)
Max Seq. len
Samples Dropped
108
0
105
0
100
0
- SNLI
Stanford NLI Corpus (Not officially part of GLUE) (Metric: Accuracy)
Class
Training Samples
Dev Samples
Test Samples
0 Contradiction
1 Entailment
2 Neutral
183187 (33.35%)
183416 (33.39%)
182764 (33.27%)
3278 (33.31%)
3329 (33.82%)
3235 (32.87%)
Unknown
Unknown
Unknown
Total
549367 (96.54%)
9842 (1.73%)
9824 (1.73%)
Max Seq. len
Samples Dropped
71
0
59
0
36
0
- AX
Auxiliary Task (GLUE Diagnostic Dataset)
This task is only available for ‘test’ dataset. It is used when submitting results to GLUE website. The labels should be predicted using the model trained by
MNLI-M
task. The labels for the samples in this dataset are unknown. Here are some statistics:Total Samples: 1104
Max Seq. len: 121
Samples Dropped: 0
- class fireball.datasets.glue.GlueDSet(taskName, dsName='Train', dataPath=None, batchSize=8, tokenizer=None, numWorkers=0)
This class implements the GLUE group of datasets.
Constructs a GlueDSet instance. This can be called directly or via
makeDatasets()
class method.- Parameters:
taskName (str) –
- One of the GLUE task names. Currently the following tasks are supported:
"CoLA"
: The Corpus of Linguistic Acceptability"SST-2"
: The Stanford Sentiment Treebank"MRPC"
: Microsoft Research Paraphrase Corpus"STS-B"
: Semantic Textual Similarity Benchmark"QQP"
: Quora Question Pairs"MNLI-M"
: MultiNLI Matched"MNLI-MM"
: MultiNLI Mismatched"QNLI"
: Question NLI"RTE"
: Recognizing Textual Entailment"WNLI"
: Winograd NLI"SNLI"
: Stanford NLI Corpus"AX"
: Auxiliary Task (GLUE Diagnostic Dataset)
dsName (str) – The name of the dataset. It can be one “Train”, “Dev”, or “Test”.
dataPath (str) – The path to the directory where the dataset files are located.
batchSize (int) – The default batch size used in the “batches” method.
tokenizer (Tokenizer object) – The tokenizer used to tokenize the text info in the dataset files.
numWorkers (int) – The number of worker threads used to load the samples.
- __init__(taskName, dsName='Train', dataPath=None, batchSize=8, tokenizer=None, numWorkers=0)
Constructs a GlueDSet instance. This can be called directly or via
makeDatasets()
class method.- Parameters:
taskName (str) –
- One of the GLUE task names. Currently the following tasks are supported:
"CoLA"
: The Corpus of Linguistic Acceptability"SST-2"
: The Stanford Sentiment Treebank"MRPC"
: Microsoft Research Paraphrase Corpus"STS-B"
: Semantic Textual Similarity Benchmark"QQP"
: Quora Question Pairs"MNLI-M"
: MultiNLI Matched"MNLI-MM"
: MultiNLI Mismatched"QNLI"
: Question NLI"RTE"
: Recognizing Textual Entailment"WNLI"
: Winograd NLI"SNLI"
: Stanford NLI Corpus"AX"
: Auxiliary Task (GLUE Diagnostic Dataset)
dsName (str) – The name of the dataset. It can be one “Train”, “Dev”, or “Test”.
dataPath (str) – The path to the directory where the dataset files are located.
batchSize (int) – The default batch size used in the “batches” method.
tokenizer (Tokenizer object) – The tokenizer used to tokenize the text info in the dataset files.
numWorkers (int) – The number of worker threads used to load the samples.
- classmethod download(taskName, dataFolder=None)
This class method can be called to download the GLUE dataset files.
- Parameters:
taskName (str) – The name of task for which the dataset files will be downloaded. If set to
"All"
the dataset files for all GLUE tasks will be downloaded. Otherwise it should be one of the following: *"CoLA"
: The Corpus of Linguistic Acceptability *"SST-2"
: The Stanford Sentiment Treebank *"MRPC"
: Microsoft Research Paraphrase Corpus *"STS-B"
: Semantic Textual Similarity Benchmark *"QQP"
: Quora Question Pairs *"MNLI-M"
: MultiNLI Matched *"MNLI-MM"
: MultiNLI Mismatched *"QNLI"
: Question NLI *"RTE"
: Recognizing Textual Entailment *"WNLI"
: Winograd NLI *"SNLI"
: Stanford NLI Corpus *"AX"
: Auxiliary Task (GLUE Diagnostic Dataset)dataFolder (str) – The folder where dataset files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there. In other words, the default data folder is
~/data
- classmethod makeDatasets(taskName, dsNames='Train,Dev,Test', batchSize=8, dataPath=None, tokenizer=None, numWorkers=0)
This class method creates several datasets in one-shot as specified by dsNames parameter.
- Parameters:
taskName (str) – One of the GLUE task names. Please refer to the documentation for the
__init__()
method above for more details about supported tasks.dsNames (str) –
A combination of the following:
Train
: Create the training dataset.Dev
: Create the dev dataset.Test
: Create the test dataset. (Labels unknown)
batchSize (int) – The batchSize used for all the datasets created.
dataPath (str) – The path to the directory where the dataset files are located.
tokenizer (Tokenizer object) – The tokenizer used by all created datasets. If this is None, and there is a
vocab.txt
file in thedataPath
, this method tries to create a tokenizer usingvocab.txt
as its vocabulary.numWorkers (int) – The number of worker threads used to load the samples.
- Returns:
Depending on the number of items specified in the dsNames, it returns one to three GlueDSet objects. The returned values have the same order as they appear in the dsNames parameter.
- Return type:
Up to 3
GlueDSet
objects
- classmethod printDsInfo(trainDs=None, devDs=None, testDs=None)
This class method prints information about given set of datasets in a single table.
- Parameters:
trainDs (any object derived from BaseDSet, optional) – The training dataset.
devDs (any object derived from BaseDSet, optional) – The dev dataset.
testDs (any object derived from BaseDSet, optional) – The test dataset.
- classmethod printStats(trainDs=None, devDs=None, testDs=None)
This class method prints statistics of the given set of datasets in a single table.
- Parameters:
trainDs (any object derived from BaseDSet, optional) – The training dataset.
devDs (any object derived from BaseDSet, optional) – The dev dataset.
testDs (any object derived from BaseDSet, optional) – The test dataset. (Unknown Labels)
- getBatch(batchIndexes)
This method returns a batch of samples and labels from the dataset as specified by the list of indexes in the batchIndexes parameter.
- Parameters:
batchIndexes (list of int) – A list of indexes used to pick samples and labels from the dataset.
- Returns:
samples (tuple) –
This is a 4-tuple:
samples = (batchSampleIdxs, batchTokenIds, batchTokenTypes)
Where:
batchSampleIdxs: 1D list of sample indexes for the samples in this batch.
batchTokenIds: 2D list of integer tokenId values for ‘n’ sequences (where ‘n’ is current batch size). Each sequence (row) contains ‘maxSeqLen’ tokens including CLS, SEP, and paddings
batchTokenTypes: 2D list of integers one for each tokenId in “batchTokenIds”. ‘0’ is used for first text tokens and ‘1’ for the second text tokens. ‘0’ is also used for the padding tokens.
labels (numpy array of int32 or float32) – For classification tasks this contains the label for each one of batch samples. For regression tasks, this contains the ground-truth value for each one of the batch samples.
- evaluate(predicted, actual, topK=0, confMat=False, expAcc=None, quiet=False)
Returns information about evaluation results based on the “predicted” and “actual” values. This is usually called by the
evaluateModel()
method which should be called to evaluate a model with this dataset.- Parameters:
predicted (array) – The predicted values of the output for the test samples. This is a 1-D arrays of labels for Classification tasks or an array of output values for Regression tasks (STS-B).
actual (array) – The actual values of the output for the test samples.
topK (int) – Not used for this dataset
confMat (Boolean) – For classification cases, this indicates whether the confusion matrix should be calculated. If the number of classes is more than 10, this argument is ignored and confusion matrix is not calculated. This is ignored for regression cases.
expAcc (Boolean or None) –
Ignored for regression cases. For classification cases:
If this is a True, the expected accuracy and kappa values are also calculated. When the number of classes and/or number of evaluation samples is large, calculating expected accuracy can take a long time.
If this is False, the expected accuracy and kappa are not calculated.
If this is None (the default), then the expected accuracy and kappa are calculated only if number of classes does not exceed 10.
Note: If confMat is True, then expAcc is automatically set to True.
quiet (Boolean) – If False, it prints the test results.
- Returns:
For ‘test’ datasets, since the labels are unknown, this function returns a list of tuples like (index, predictedLabel) which can be used to make the ‘tsv’ files for submission to GLUE website. Otherwise, a dictionary of evaluation result values is returned.
- Return type:
list of tuples or dict
RadioML
This module contains the implementation of RadioMlDSet
class that encapsulates the RadioML dataset for Modulation Classification problem.
The directory structure should be something like this:
data
RadioML
RML2016_10b
RML2018_01
Use the RadioMlDSetUnitTest.py
file in the UnitTest/Datasets
folder to run the Unit Test of this implementation.
The dataset contains samples of shape 128x1x2 (1024x1x2 for 2018 version) and labels which indicate one of 10 modulation types (24 for 2018 version). Each sample contains the floating point numbers captured as a time series. The time-series are the samples taken from I and Q signals. The samples were captured at 20 different SNR values (26 for 2018 version).
Note
For the 2018 version, you need to download the “RML2018_01” dataset and run the createNpzFiles()
class method (only once) to extract the numpy files for each SNR value from the original “GOLD_XYZ_OSC.0001_1024.hdf5” file.
Dataset Stats (2016)
SNR Values: -20, -18, … 18 (20 values)
Classes:
0:8PSK 1:AM-DSB 2:BPSK 3:CPFSK 4:GFSK 5:PAM4 6:QAM16 7:QAM64 8:QPSK 9:WBFMSamples per SNR per class: 6,000
Samples per SNR value: 60,000
Samples per class: 120,000
Total Samples: 1,200,000 (= 10 * 20 * 6000)
The range of values in the dataset depends on the SNR values. The range is usually larger for larger SNR values; except for very small SNR values since they mostly contain noise. Here are the ranges for a few examples:
SNR = -20: -0.030309 .. 0.032651 SNR = -10: -0.030112 .. 0.031982 SNR = 0: -0.061123 .. 0.066305 SNR = 10: -0.129547 .. 0.106047 SNR = 18: -0.161878 .. 0.180627 SNR = All: -0.210572 .. 0.180627
Dataset Stats (2018)
SNR Values: -20, -18, … 28, 30 (26 values)
Classes:
0:32PSK 1:16APSK 2:32QAM 3:FM 4:GMSK 5:32APSK 6:OQPSK 7:8ASK 8:BPSK 9:8PSK 10:AM-SSB-SC 11:4ASK 12:16PSK 13:64APSK 14:128QAM 15:128APSK 16:AM-DSB-SC 17:AM-SSB-WC 18:64QAM 19:QPSK 20:256QAM 21:AM-DSB-WC 22:OOK 23:16QAMSamples per SNR per class: 4,096
Samples per SNR value: 98,304
Samples per class: 106,496
Total Samples: 2,555,904 (= 24 * 26 * 4096)
Max Absolute Value in the whole dataset: 68.32339
The range of values in the dataset depends on the SNR values. The range is usually larger for larger SNR values; except for very small SNR values since they mostly contain noise. Here are the ranges for a few examples:
SNR = -20: -3.921285 .. 3.855273 SNR = -10: -3.941709 .. 4.169807 SNR = 0: -4.842472 .. 4.522851 SNR = 10: -12.566387 .. 12.534958 SNR = 20: -49.853298 .. 50.442608 SNR = 30: -37.581909 .. 41.837891 SNR = All: -68.323387 .. 51.562645
- class fireball.datasets.radioml.RadioMlDSet(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=128, snrValues=None, version=2016, labelMode='MOD')
This class implements the RadioML dataset.
Constructs a
RadioMlDSet
instance. This can be called directly or viamakeDatasets()
class method.- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) – The path to the directory where the dataset files are located.
samples (numpy array or None) – If specified, it is used as the samples for the dataset. It is a numpy array of samples. Each sample is numpy array of shape (128,1,2) for the 2016 version or (1024,1,2) for 2018 version.
labels (numpy array or None) – If specified, it is a numpy array of int32 values. Each label is an int32 number indicating the class for each sample. (see the classNames above)
batchSize (int) – The default batch size used in the “batches” method.
snrValues (int, list of ints, or None) –
If it is an int, it specifies the single SNR value that must be used. The dataset will contain the data for the specified SNR value only.
If it is a list of ints, only the samples for the specified SNR values are included in the dataset.
If it is None, all samples for all SNR values are included in the dataset.
version (int) –
The version of the RadioML dataset. It can be either 1016 (the default) or 2018. The actual dataset versions used are as follows:
2016: RADIOML 2016.10B
2018: RADIOML 2018.01A
labelMode (str) –
Specifies the type of label to be returned in batches of the data.
MOD
: This mode returns the modulation class as the label. (Default)SNR
: This mode returns the SNR index as the labelBOTH
: This mode returns a tuple of Modulation classes and SNR indexes as labels
- __init__(dsName='Train', dataPath=None, samples=None, labels=None, batchSize=128, snrValues=None, version=2016, labelMode='MOD')
Constructs a
RadioMlDSet
instance. This can be called directly or viamakeDatasets()
class method.- Parameters:
dsName (str) – The name of the dataset. It can be one of “Train”, “Test”, or “Tune”. Note that “Valid” cannot be used here.
dataPath (str) – The path to the directory where the dataset files are located.
samples (numpy array or None) – If specified, it is used as the samples for the dataset. It is a numpy array of samples. Each sample is numpy array of shape (128,1,2) for the 2016 version or (1024,1,2) for 2018 version.
labels (numpy array or None) – If specified, it is a numpy array of int32 values. Each label is an int32 number indicating the class for each sample. (see the classNames above)
batchSize (int) – The default batch size used in the “batches” method.
snrValues (int, list of ints, or None) –
If it is an int, it specifies the single SNR value that must be used. The dataset will contain the data for the specified SNR value only.
If it is a list of ints, only the samples for the specified SNR values are included in the dataset.
If it is None, all samples for all SNR values are included in the dataset.
version (int) –
The version of the RadioML dataset. It can be either 1016 (the default) or 2018. The actual dataset versions used are as follows:
2016: RADIOML 2016.10B
2018: RADIOML 2018.01A
labelMode (str) –
Specifies the type of label to be returned in batches of the data.
MOD
: This mode returns the modulation class as the label. (Default)SNR
: This mode returns the SNR index as the labelBOTH
: This mode returns a tuple of Modulation classes and SNR indexes as labels
- classmethod download(dataFolder=None)
This class method can be called to download the RadioML dataset files from a Fireball online repository.
- Parameters:
dataFolder (str) – The folder where dataset files are saved. If this is not provided, then a folder named “data” is created in the home directory of the current user and the dataset folders and files are created there. In other words, the default data folder is
~/data
- classmethod configure(testRatio=0.5, tuneRatio=0.1, validRatio=0.1, version=2016)
Configures the
RadioMlDSet
class. If a behavior other than the default is needed, this function can be called to prepare the class before instantiating the dataset instances.Since the RadioML dataset doesn’t have a standard split for train, validation, and test samples, you can use this function to define the splits.
- Parameters:
testRatio (float) – The ratio between the number of test samples to the total samples in the dataset. The default is 0.5 which means 50% of the data is used for training and 50% for test.
tuneRatio (float) – The ratio between the number of tuning samples to the number of training sample. The default is 0.1, which means 10% of the training samples are used for tuning dataset.
validRatio (float) – The ratio between the number of validation samples to the number of training sample. The default is 0.1, which means 10% of the training samples are used for validation dataset. The remaining 90% are used as training samples. Please note that if “valid” is not specified when
makeDatasets()
is called, then all training samples are used for training.version (int) –
The version of the RadioML dataset. It can be either 1016 (the default) or 2018. The actual dataset versions used are as follows:
2016: RADIOML 2016.10B
2018: RADIOML 2018.01A
Example
For example the following call:
RadioMlDSet.configure(testRatio=.2, validRatio=0.1)
can be used to have 20% of the samples for test, 72% for training, and 8% for validation.
To Fine-Tune the trained model the following call:
RadioMlDSet.configure(testRatio=.2, tuneRatio=0.1, validRatio=0.1)
can be used to have 20% of samples for test, 7.2% for training (Fine-Tuning), and 0.8% for validation.
- classmethod makeDatasets(dsNames='Train,Test,Valid', batchSize=128, dataPath=None, snrValues=None, version=2016, labelMode='MOD')
This class method creates several datasets as specified by dsNames parameter in one-shot.
- Parameters:
dsNames (str) –
A combination of the following:
”Train”: Create the training dataset.
”Test”: Create the test dataset.
”Valid”: Create the validation dataset. The ratio of validation samples can be specified using the
configure()
method before calling to this function. The default is 10% of training data. If it is used with tuning data, the ratio specifies the portion of tune samples (not train samples).”Tune”: Create the Fine-Tuning datas. The ratio of fine-tuning samples can be specified using the
configure()
method before calling to this function. The default is 10% of training data.
batchSize (int) – The batchSize used for all the datasets created.
dataPath (str) – The path to the directory where the dataset files are located.
snrValues (int, list of ints, or None) – Indicates which SNR values must be included in the dataset. Refer to the snrValues parameter of
__init__()
method for more info.version (int) –
The version of the RadioML dataset. It can be either 1016 (the default) or 2018. The actual datasets used are as follows:
2016: RADIOML 2016.10B
2018: RADIOML 2018.01A
labelMode (str) –
Specifies the type of label to be returned in batches of the data.
MOD
: This mode returns the modulation class as the label. (Default)SNR
: This mode returns the SNR index as the labelBOTH
: This mode returns a tuple of Modulation classes and SNR indexes as labels
- Returns:
Depending on the number of items specified in the dsNames, it returns between one and three
RadioMlDSet
objects. The returned values have the same order as they appear in the dsNames parameter.- Return type:
Up to 3
RadioMlDSet
objects
Note
To specify the training dataset, any string containing the word “train” (case insensitive) is accepted. So, “Training”, “TRAIN”, and ‘train’ all can be used.
To specify the test dataset, any string containing the word “test” (case insensitive) is accepted. So, “testing”, “TEST”, and ‘test’ all can be used.
To specify the validation dataset, any string containing the word “valid” (case insensitive) is accepted. So, “Validation”, “VALID”, and ‘valid’ all can be used.
To specify the fine-tuning dataset, any string containing the word “tun” (case insensitive) is accepted. So, “Fine-Tuning”, “Tuning”, and ‘tune’ all can be used.
- classmethod printDsInfo(trainDs=None, testDs=None, validDs=None)
This class method prints information about given set of datasets in a single table.
- Parameters:
trainDs (any object derived from BaseDSet, optional) – The training dataset.
testDs (any object derived from BaseDSet, optional) – The test dataset.
validDs (any object derived from BaseDSet, optional) – The validation dataset.
- getBatch(batchIndexes)
This method returns a batch of samples and labels from the dataset as specified by the list of indexes in the batchIndexes parameter.
- Parameters:
batchIndexes (list of int) – A list of indexes used to pick samples and labels from the dataset.
- Returns:
samples (list or numpy array) – The batch samples specified by the batchIndexes.
labels (list or numpy array) – The batch labels specified by the batchIndexes.
- classmethod createNpzFiles(dataPath=None)
For the 2018 version of RadioML dataset, this function reads the dataset information from the original dataset file
GOLD_XYZ_OSC.0001_1024.hdf5
and creates an “npz” file for each SNR value.This only needs to be done once before this dataset can be used.
- Parameters:
dataPath (str) – The path to the directory where the dataset files are located.