API Reference

A tutorial-like presentation is available at Usage examples, using the following API.

class pyradigm.MLDataset(filepath=None, in_dataset=None, arff_path=None, data=None, labels=None, classes=None, description='', feature_names=None, encode_nonnumeric=False)[source]

Bases: object

An ML dataset to ease workflow and maintain integrity.


Helper to rename the classes, if provided by a dict keyed in by the orignal keys

classes : dict
Dict of class named keyed in by sample IDs.
If classes is not a dict.
If all samples in dataset are not present in input dict, or one of they samples in input is not recognized.
add_sample(sample_id, features, label, class_id=None, overwrite=False, feature_names=None)[source]

Adds a new sample to the dataset with its features, label and class ID.

This is the preferred way to construct the dataset.

sample_id : str, int
The identifier that uniquely identifies this sample.
features : list, ndarray
The features for this sample
label : int, str
The label for this sample
class_id : int, str
The class for this sample. If not provided, label converted to a string becomes its ID.
overwrite : bool
If True, allows the overwite of features for an existing subject ID. Default : False.
feature_names : list
The names for each feature. Assumed to be in the same order as features
If sample_id is already in the MLDataset (and overwrite=False), or If dimensionality of the current sample does not match the current, or If feature_names do not match existing names
If sample to be added is of different data type compared to existing samples.
classmethod check_features(features)[source]

Method to ensure data to be added is not empty and vectorized.

features : iterable
Any data that can be converted to a numpy array.
features : numpy array
Flattened non-empty numpy array.
If input data is empty.

Set of unique classes in the dataset.


Returns the sizes of different objects in a Counter object.


Identifiers (sample IDs, or sample names etc) forming the basis of dict-type MLDataset.


data in its original dict form.


Dataset features and labels in a matrix form for learning.

Also returns sample_ids in the same order.

data_matrix : ndarray
2D array of shape [num_samples, num_features] with features corresponding row-wise to sample_ids
labels : ndarray
Array of numeric labels for each sample corresponding row-wise to sample_ids
sample_ids : list
List of sample ids

Method to remove a sample from the dataset.

sample_id : str
sample id to be removed.
If sample id to delete was not found in the dataset.

Text description (header) that can be set by user.


number of features in each sample.


Method to extend the dataset vertically (add samples from anotehr dataset).

other : MLDataset
second dataset to be combined with the current (different samples, but same dimensionality)
if input is not an MLDataset.

Returns the feature names as an numpy array of strings.

get(item, not_found_value=None)[source]

Method like dict.get() which can return specified value if key not found


Returns a smaller dataset belonging to the requested classes.

class_id : str
identifier of the class to be returned.
With subset of samples belonging to the given class.
If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None

Returns the subset of features indexed numerically.

subset_idx : list, ndarray
List of indices to features to be returned
MLDataset : MLDataset
with subset of features requested.
If input indices are out of bounds for the dataset.

Returns a smaller dataset identified by their keys/sample IDs.

subset_ids : list
List od sample IDs to extracted from the dataset.
sub-dataset : MLDataset
sub-dataset containing only requested sample IDs.

Quick and partial glance of the data matrix.

nitems : int
Number of items to glance from the dataset. Default : 5



Sample identifiers (strings) forming the basis of MLDataset (same as sample_ids)

static keys_with_value(dictionary, value)[source]

Returns a subset of keys from the dict with the value supplied.


Set of labels in the dataset corresponding to class_set.


Returns the array of labels for all the samples.


Total number of classes in the dataset.


number of features in each sample.


number of samples in the entire dataset.


Returns a random sub-dataset (of specified size by percentage) within each class.

perc_in_class : float
Fraction of samples to be taken from each class.
subdataset : MLDataset
random sub-dataset of specified size.

Returns a random subset of sample ids (of specified size by percentage) within each class.

perc_per_class : float
Fraction of samples per class
subset : list
Combined list of sample ids from all classes.
If no subjects from one or more classes were selected.
If an empty or full dataset is requested.
Returns a random subset of sample ids of specified size by count,
within each class.
count_per_class : int
Exact number of samples per each class.
subset : list
Combined list of sample ids from all classes.

Sample identifiers (strings) forming the basis of MLDataset (same as keys).


Returns a list of sample ids belonging to a given class.

class_id : str
class id to query.
subset_ids : list
List of sample ids belonging to a given class.

Method to save the dataset to disk.

file_path : str
File path to save the current dataset to
If saving to disk is not successful.

Summary of classes: names, numeric labels and sizes

tuple : class_set, label_set, class_sizes

class_set : list
List of names of all the classes
label_set : list
Label for each class in class_set
class_sizes : list
Size of each class (number of samples)
train_test_split_ids(train_perc=None, count_per_class=None)[source]

Returns two disjoint sets of sample ids for use in cross-validation.

Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.

train_perc : float
fraction of samples from each class to build the training subset.
count_per_class : int
exact count of samples from each class to build the training subset.
train_set : list
List of ids in the training set.
test_set : list
List of ids in the test set.
If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecongized format is provided for input args, or If the selection results in empty subsets for either train or test sets.
transform(func, func_description=None)[source]
Applies a given a function to the features of each subject
and returns a new dataset with other info unchanged.
func : callable

A valid callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.

If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.

func_description : str, optional
Human readable description of the given function.
xfm_ds : MLDataset
with features obtained from subject-wise transform
If given func is not a callable
If transformation of any of the subjects features raises an exception.


from pyradigm import MLDataset

thickness = MLDataset(in_path='ADNI_thickness.csv')
pcg_thickness = thickness.apply_xfm(func=get_pcg, description = 'applying ROI mask for PCG')
pcg_median = pcg_thickness.apply_xfm(func=np.median, description='median per subject')

Complex example with function taking more than one argument:

from pyradigm import MLDataset
from functools import partial
import hiwenet

thickness = MLDataset(in_path='ADNI_thickness.csv')
roi_membership = read_roi_membership()
hw = partial(hiwenet, groups = roi_membership)

thickness_hiwenet = thickness.transform(func=hw, description = 'histogram weighted networks')
median_thk_hiwenet = thickness_hiwenet.transform(func=np.median, description='median per subject')

Command line interface

This is the command line interface

  • to display basic info about datasets without having to code
  • to perform basic arithmetic (add multiple classes or feature sets)