API Reference

A tutorial-like presentation is available at Usage examples, using the following API.

class pyradigm.MLDataset(filepath=None, in_dataset=None, arff_path=None, data=None, labels=None, classes=None, description='', feature_names=None, encode_nonnumeric=False)[source]

Bases: object

An ML dataset to ease workflow and maintain integrity.

add_classes(classes)[source]

Helper to rename the classes, if provided by a dict keyed in by the orignal keys

classes : dict
Dict of class named keyed in by sample IDs.
TypeError
If classes is not a dict.
ValueError
If all samples in dataset are not present in input dict, or one of they samples in input is not recognized.
add_sample(sample_id, features, label, class_id=None, overwrite=False, feature_names=None)[source]

Adds a new sample to the dataset with its features, label and class ID.

This is the preferred way to construct the dataset.

sample_id : str, int
The identifier that uniquely identifies this sample.
features : list, ndarray
The features for this sample
label : int, str
The label for this sample
class_id : int, str
The class for this sample. If not provided, label converted to a string becomes its ID.
overwrite : bool
If True, allows the overwite of features for an existing subject ID. Default : False.
feature_names : list
The names for each feature. Assumed to be in the same order as features
ValueError
If sample_id is already in the MLDataset (and overwrite=False), or If dimensionality of the current sample does not match the current, or If feature_names do not match existing names
TypeError
If sample to be added is of different data type compared to existing samples.
classmethod check_features(features)[source]

Method to ensure data to be added is not empty and vectorized.

features : iterable
Any data that can be converted to a numpy array.
features : numpy array
Flattened non-empty numpy array.
ValueError
If input data is empty.
class_set

Set of unique classes in the dataset.

class_sizes

Returns the sizes of different objects in a Counter object.

classes

Identifiers (sample IDs, or sample names etc) forming the basis of dict-type MLDataset.

data

data in its original dict form.

data_and_labels()[source]

Dataset features and labels in a matrix form for learning.

Also returns sample_ids in the same order.

data_matrix : ndarray
2D array of shape [num_samples, num_features] with features corresponding row-wise to sample_ids
labels : ndarray
Array of numeric labels for each sample corresponding row-wise to sample_ids
sample_ids : list
List of sample ids
del_sample(sample_id)[source]

Method to remove a sample from the dataset.

sample_id : str
sample id to be removed.
UserWarning
If sample id to delete was not found in the dataset.
description

Text description (header) that can be set by user.

dtype

number of features in each sample.

extend(other)[source]

Method to extend the dataset vertically (add samples from anotehr dataset).

other : MLDataset
second dataset to be combined with the current (different samples, but same dimensionality)
TypeError
if input is not an MLDataset.
feature_names

Returns the feature names as an numpy array of strings.

get(item, not_found_value=None)[source]

Method like dict.get() which can return specified value if key not found

get_class(class_id)[source]

Returns a smaller dataset belonging to the requested classes.

class_id : str
identifier of the class to be returned.
MLDataset
With subset of samples belonging to the given class.
ValueError
If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None
get_feature_subset(subset_idx)[source]

Returns the subset of features indexed numerically.

subset_idx : list, ndarray
List of indices to features to be returned
MLDataset : MLDataset
with subset of features requested.
UnboundLocalError
If input indices are out of bounds for the dataset.
get_subset(subset_ids)[source]

Returns a smaller dataset identified by their keys/sample IDs.

subset_ids : list
List od sample IDs to extracted from the dataset.
sub-dataset : MLDataset
sub-dataset containing only requested sample IDs.
glance(nitems=5)[source]

Quick and partial glance of the data matrix.

nitems : int
Number of items to glance from the dataset. Default : 5

dict

keys

Sample identifiers (strings) forming the basis of MLDataset (same as sample_ids)

static keys_with_value(dictionary, value)[source]

Returns a subset of keys from the dict with the value supplied.

label_set

Set of labels in the dataset corresponding to class_set.

labels

Returns the array of labels for all the samples.

num_classes

Total number of classes in the dataset.

num_features

number of features in each sample.

num_samples

number of samples in the entire dataset.

random_subset(perc_in_class=0.5)[source]

Returns a random sub-dataset (of specified size by percentage) within each class.

perc_in_class : float
Fraction of samples to be taken from each class.
subdataset : MLDataset
random sub-dataset of specified size.
random_subset_ids(perc_per_class=0.5)[source]

Returns a random subset of sample ids (of specified size by percentage) within each class.

perc_per_class : float
Fraction of samples per class
subset : list
Combined list of sample ids from all classes.
ValueError
If no subjects from one or more classes were selected.
UserWarning
If an empty or full dataset is requested.
random_subset_ids_by_count(count_per_class=1)[source]
Returns a random subset of sample ids of specified size by count,
within each class.
count_per_class : int
Exact number of samples per each class.
subset : list
Combined list of sample ids from all classes.
sample_ids

Sample identifiers (strings) forming the basis of MLDataset (same as keys).

sample_ids_in_class(class_id)[source]

Returns a list of sample ids belonging to a given class.

class_id : str
class id to query.
subset_ids : list
List of sample ids belonging to a given class.
save(file_path)[source]

Method to save the dataset to disk.

file_path : str
File path to save the current dataset to
IOError
If saving to disk is not successful.
summarize_classes()[source]

Summary of classes: names, numeric labels and sizes

tuple : class_set, label_set, class_sizes

class_set : list
List of names of all the classes
label_set : list
Label for each class in class_set
class_sizes : list
Size of each class (number of samples)
train_test_split_ids(train_perc=None, count_per_class=None)[source]

Returns two disjoint sets of sample ids for use in cross-validation.

Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.

train_perc : float
fraction of samples from each class to build the training subset.
count_per_class : int
exact count of samples from each class to build the training subset.
train_set : list
List of ids in the training set.
test_set : list
List of ids in the test set.
ValueError
If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecongized format is provided for input args, or If the selection results in empty subsets for either train or test sets.
transform(func, func_description=None)[source]
Applies a given a function to the features of each subject
and returns a new dataset with other info unchanged.
func : callable

A valid callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.

If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.

func_description : str, optional
Human readable description of the given function.
xfm_ds : MLDataset
with features obtained from subject-wise transform
TypeError
If given func is not a callable
ValueError
If transformation of any of the subjects features raises an exception.

Simple:

from pyradigm import MLDataset

thickness = MLDataset(in_path='ADNI_thickness.csv')
pcg_thickness = thickness.apply_xfm(func=get_pcg, description = 'applying ROI mask for PCG')
pcg_median = pcg_thickness.apply_xfm(func=np.median, description='median per subject')

Complex example with function taking more than one argument:

from pyradigm import MLDataset
from functools import partial
import hiwenet

thickness = MLDataset(in_path='ADNI_thickness.csv')
roi_membership = read_roi_membership()
hw = partial(hiwenet, groups = roi_membership)

thickness_hiwenet = thickness.transform(func=hw, description = 'histogram weighted networks')
median_thk_hiwenet = thickness_hiwenet.transform(func=np.median, description='median per subject')
pyradigm.cli_run()[source]

Command line interface

This is the command line interface

  • to display basic info about datasets without having to code
  • to perform basic arithmetic (add multiple classes or feature sets)