API Reference¶
A tutorial-like presentation is available at Usage examples, using the following API.
-
class
pyradigm.
MLDataset
(filepath=None, in_dataset=None, arff_path=None, data=None, labels=None, classes=None, description='', feature_names=None, encode_nonnumeric=False)[source]¶ Bases:
object
An ML dataset to ease workflow and maintain integrity.
-
add_classes
(classes)[source]¶ Helper to rename the classes, if provided by a dict keyed in by the orignal keys
- classes : dict
- Dict of class named keyed in by sample IDs.
- TypeError
- If classes is not a dict.
- ValueError
- If all samples in dataset are not present in input dict, or one of they samples in input is not recognized.
-
add_sample
(sample_id, features, label, class_id=None, overwrite=False, feature_names=None)[source]¶ Adds a new sample to the dataset with its features, label and class ID.
This is the preferred way to construct the dataset.
- sample_id : str, int
- The identifier that uniquely identifies this sample.
- features : list, ndarray
- The features for this sample
- label : int, str
- The label for this sample
- class_id : int, str
- The class for this sample. If not provided, label converted to a string becomes its ID.
- overwrite : bool
- If True, allows the overwite of features for an existing subject ID. Default : False.
- feature_names : list
- The names for each feature. Assumed to be in the same order as features
- ValueError
- If sample_id is already in the MLDataset (and overwrite=False), or If dimensionality of the current sample does not match the current, or If feature_names do not match existing names
- TypeError
- If sample to be added is of different data type compared to existing samples.
-
classmethod
check_features
(features)[source]¶ Method to ensure data to be added is not empty and vectorized.
- features : iterable
- Any data that can be converted to a numpy array.
- features : numpy array
- Flattened non-empty numpy array.
- ValueError
- If input data is empty.
-
class_set
¶ Set of unique classes in the dataset.
-
class_sizes
¶ Returns the sizes of different objects in a Counter object.
-
classes
¶ Identifiers (sample IDs, or sample names etc) forming the basis of dict-type MLDataset.
-
data
¶ data in its original dict form.
-
data_and_labels
()[source]¶ Dataset features and labels in a matrix form for learning.
Also returns sample_ids in the same order.
- data_matrix : ndarray
- 2D array of shape [num_samples, num_features] with features corresponding row-wise to sample_ids
- labels : ndarray
- Array of numeric labels for each sample corresponding row-wise to sample_ids
- sample_ids : list
- List of sample ids
-
del_sample
(sample_id)[source]¶ Method to remove a sample from the dataset.
- sample_id : str
- sample id to be removed.
- UserWarning
- If sample id to delete was not found in the dataset.
-
description
¶ Text description (header) that can be set by user.
-
dtype
¶ number of features in each sample.
-
extend
(other)[source]¶ Method to extend the dataset vertically (add samples from anotehr dataset).
- other : MLDataset
- second dataset to be combined with the current (different samples, but same dimensionality)
- TypeError
- if input is not an MLDataset.
-
feature_names
¶ Returns the feature names as an numpy array of strings.
-
get
(item, not_found_value=None)[source]¶ Method like dict.get() which can return specified value if key not found
-
get_class
(class_id)[source]¶ Returns a smaller dataset belonging to the requested classes.
- class_id : str
- identifier of the class to be returned.
- MLDataset
- With subset of samples belonging to the given class.
- ValueError
- If one or more of the requested classes do not exist in this dataset. If the specified id is empty or None
-
get_feature_subset
(subset_idx)[source]¶ Returns the subset of features indexed numerically.
- subset_idx : list, ndarray
- List of indices to features to be returned
- MLDataset : MLDataset
- with subset of features requested.
- UnboundLocalError
- If input indices are out of bounds for the dataset.
-
get_subset
(subset_ids)[source]¶ Returns a smaller dataset identified by their keys/sample IDs.
- subset_ids : list
- List od sample IDs to extracted from the dataset.
- sub-dataset : MLDataset
- sub-dataset containing only requested sample IDs.
-
glance
(nitems=5)[source]¶ Quick and partial glance of the data matrix.
- nitems : int
- Number of items to glance from the dataset. Default : 5
dict
-
keys
¶ Sample identifiers (strings) forming the basis of MLDataset (same as sample_ids)
-
static
keys_with_value
(dictionary, value)[source]¶ Returns a subset of keys from the dict with the value supplied.
-
label_set
¶ Set of labels in the dataset corresponding to class_set.
-
labels
¶ Returns the array of labels for all the samples.
-
num_classes
¶ Total number of classes in the dataset.
-
num_features
¶ number of features in each sample.
-
num_samples
¶ number of samples in the entire dataset.
-
random_subset
(perc_in_class=0.5)[source]¶ Returns a random sub-dataset (of specified size by percentage) within each class.
- perc_in_class : float
- Fraction of samples to be taken from each class.
- subdataset : MLDataset
- random sub-dataset of specified size.
-
random_subset_ids
(perc_per_class=0.5)[source]¶ Returns a random subset of sample ids (of specified size by percentage) within each class.
- perc_per_class : float
- Fraction of samples per class
- subset : list
- Combined list of sample ids from all classes.
- ValueError
- If no subjects from one or more classes were selected.
- UserWarning
- If an empty or full dataset is requested.
-
random_subset_ids_by_count
(count_per_class=1)[source]¶ - Returns a random subset of sample ids of specified size by count,
- within each class.
- count_per_class : int
- Exact number of samples per each class.
- subset : list
- Combined list of sample ids from all classes.
-
sample_ids
¶ Sample identifiers (strings) forming the basis of MLDataset (same as keys).
-
sample_ids_in_class
(class_id)[source]¶ Returns a list of sample ids belonging to a given class.
- class_id : str
- class id to query.
- subset_ids : list
- List of sample ids belonging to a given class.
-
save
(file_path)[source]¶ Method to save the dataset to disk.
- file_path : str
- File path to save the current dataset to
- IOError
- If saving to disk is not successful.
-
summarize_classes
()[source]¶ Summary of classes: names, numeric labels and sizes
tuple : class_set, label_set, class_sizes
- class_set : list
- List of names of all the classes
- label_set : list
- Label for each class in class_set
- class_sizes : list
- Size of each class (number of samples)
-
train_test_split_ids
(train_perc=None, count_per_class=None)[source]¶ Returns two disjoint sets of sample ids for use in cross-validation.
Offers two ways to specify the sizes: fraction or count. Only one access method can be used at a time.
- train_perc : float
- fraction of samples from each class to build the training subset.
- count_per_class : int
- exact count of samples from each class to build the training subset.
- train_set : list
- List of ids in the training set.
- test_set : list
- List of ids in the test set.
- ValueError
- If the fraction is outside open interval (0, 1), or If counts are outside larger than the smallest class, or If unrecongized format is provided for input args, or If the selection results in empty subsets for either train or test sets.
-
transform
(func, func_description=None)[source]¶ - Applies a given a function to the features of each subject
- and returns a new dataset with other info unchanged.
- func : callable
A valid callable that takes in a single ndarray and returns a single ndarray. Ensure the transformed dimensionality must be the same for all subjects.
If your function requires more than one argument, use functools.partial to freeze all the arguments except the features for the subject.
- func_description : str, optional
- Human readable description of the given function.
- xfm_ds : MLDataset
- with features obtained from subject-wise transform
- TypeError
- If given func is not a callable
- ValueError
- If transformation of any of the subjects features raises an exception.
Simple:
from pyradigm import MLDataset thickness = MLDataset(in_path='ADNI_thickness.csv') pcg_thickness = thickness.apply_xfm(func=get_pcg, description = 'applying ROI mask for PCG') pcg_median = pcg_thickness.apply_xfm(func=np.median, description='median per subject')
Complex example with function taking more than one argument:
from pyradigm import MLDataset from functools import partial import hiwenet thickness = MLDataset(in_path='ADNI_thickness.csv') roi_membership = read_roi_membership() hw = partial(hiwenet, groups = roi_membership) thickness_hiwenet = thickness.transform(func=hw, description = 'histogram weighted networks') median_thk_hiwenet = thickness_hiwenet.transform(func=np.median, description='median per subject')
-