mambular.data_utils#

class mambular.data_utils.MambularDataset(*args: Any, **kwargs: Any)[source]#

Custom dataset for handling structured data with separate categorical and numerical features, tailored for both regression and classification tasks.

Parameters:
  • Tensors) (num_features_list (list of) –

  • Tensors)

  • Tensors (embeddings_list (list of) –

  • optional) (A flag indicating if the dataset is for a regression task. Defaults to True.) –

  • (Tensor (labels) –

  • optional)

  • (bool (regression) –

  • optional)

class mambular.data_utils.MambularDataModule(*args: Any, **kwargs: Any)[source]#

A PyTorch Lightning data module for managing training and validation data loaders in a structured way.

This class simplifies the process of batch-wise data loading for training and validation datasets during the training loop, and is particularly useful when working with PyTorch Lightning’s training framework.

Parameters:
  • preprocessor – object An instance of your preprocessor class.

  • batch_size – int Size of batches for the DataLoader.

  • shuffle – bool Whether to shuffle the training data in the DataLoader.

  • X_val – DataFrame or None, optional Validation features. If None, uses train-test split.

  • y_val – array-like or None, optional Validation labels. If None, uses train-test split.

  • val_size – float, optional Proportion of data to include in the validation split if X_val and y_val are None.

  • random_state – int, optional Random seed for reproducibility in data splitting.

  • regression – bool, optional Whether the problem is regression (True) or classification (False).

preprocess_data(X_train, y_train, X_val=None, y_val=None, embeddings_train=None, embeddings_val=None, val_size=0.2, random_state=101)[source]#

Preprocesses the training and validation data.

Parameters:
  • X_train (DataFrame or array-like, shape (n_samples_train, n_features)) – Training feature set.

  • y_train (array-like, shape (n_samples_train,)) – Training target values.

  • embeddings_train (array-like or list of array-like, optional) – Training embeddings if available.

  • X_val (DataFrame or array-like, shape (n_samples_val, n_features), optional) – Validation feature set. If None, a validation set will be created from X_train.

  • y_val (array-like, shape (n_samples_val,), optional) – Validation target values. If None, a validation set will be created from y_train.

  • embeddings_val (array-like or list of array-like, optional) – Validation embeddings if available.

  • val_size (float, optional) – Proportion of data to include in the validation split if X_val and y_val are None.

  • random_state (int, optional) – Random seed for reproducibility in data splitting.

Return type:

None

setup(stage)[source]#

Transform the data and create DataLoaders.

test_dataloader()[source]#

Returns the test dataloader.

Returns:

DataLoader instance for the test dataset.

Return type:

DataLoader

train_dataloader()[source]#

Returns the training dataloader.

Returns:

DataLoader instance for the training dataset.

Return type:

DataLoader

val_dataloader()[source]#

Returns the validation dataloader.

Returns:

DataLoader instance for the validation dataset.

Return type:

DataLoader