Skip to content

model.pipeline

Model - Workflow¤

DataSetX (DataSet) ¤

DataSetX class deals with the dataset to be used in a model

Parameters:

Name Type Description Default
config dict

a dictionary that contains the configurations.

required
base_folder str

working directory where all the artifacts are being perserved.

required

_export_train_test_data(self) private ¤

_export_train_test_data saves train and test datasets

Source code in haferml/model/pipeline.py
def _export_train_test_data(self):
    """
    _export_train_test_data saves train and test datasets
    """

    dataset_folder = self.artifacts["dataset"]["local_absolute"]

    ## Save the train and test datasets
    logger.info("Export test and train data")
    self._save_data(
        self.X_train, os.path.join(dataset_folder, "model_X_train.parquet")
    )

    self._save_data(
        self.X_test, os.path.join(dataset_folder, "model_X_test.parquet")
    )

    self._save_data(
        pd.DataFrame(self.y_train, columns=self.targets),
        os.path.join(dataset_folder, "model_y_train.parquet"),
    )

    self._save_data(
        pd.DataFrame(self.y_test, columns=self.targets),
        os.path.join(dataset_folder, "model_y_test.parquet"),
    )

    # Save dataset locally
    self._save_data(self.data, os.path.join(dataset_folder, "dataset.parquet"))

_save_data(dataframe, destination) private staticmethod ¤

_save_data saves the dataframe locally.

Parameters:

Name Type Description Default
dataframe pandas.DataFrame

dataframe to be saved

required
destination str

where the data is saved

required
Source code in haferml/model/pipeline.py
@staticmethod
def _save_data(dataframe, destination):
    """
    `_save_data` saves the dataframe locally.

    :param dataframe: dataframe to be saved
    :type dataframe: pandas.DataFrame
    :param destination: where the data is saved
    :type destination: str
    """
    logger.info("Export test and train data")
    dataframe.to_parquet(destination)

create_train_test_datasets(self, dataframe) ¤

create_train_test_datasets will create

  • self.data: the full input data right before train test split
  • self.X_train, self.y_train, self.X_test, self.y_test

Parameters:

Name Type Description Default
dataframe pandas.DataFrame

the dataframe to be splitted.

required
Source code in haferml/model/pipeline.py
def create_train_test_datasets(self, dataframe):
    """
    create_train_test_datasets will create

    - self.data: the full input data right before train test split
    - self.X_train, self.y_train, self.X_test, self.y_test

    :param dataframe: the dataframe to be splitted.
    :type dataframe: pandas.DataFrame
    """

    raise Exception("create_train_test_dataset has not yet been implemented!")

ModelSetX (ModelSet) ¤

The core of the model including hyperparameters.

Parameters:

Name Type Description Default
config dict

a dictionary that contains the configurations.

required
base_folder str

working directory where all the artifacts are being perserved.

required

hyperparameters property readonly ¤

hyperparameters specifies the hyperparameters. This is a property.

_set_hyperparameters(self) private ¤

_set_hyperparameters creates hyperpamater grid

Source code in haferml/model/pipeline.py
def _set_hyperparameters(self):
    """
    _set_hyperparameters creates hyperpamater grid
    """
    ...

create_model(self) ¤

create_model creates the model and updates the property self.model.

Source code in haferml/model/pipeline.py
def create_model(self):
    ...

ModelWorkflowX (ModelWorkflow) ¤

ModelWorkflowX class that holds DataSetX and ModelSetX

Parameters:

Name Type Description Default
config dict

a dictionary that contains the configs.

required
dataset haferml.model.DataSet

a DataSet object that contains the data and provides a create_train_test_datasets method.

required
modelset haferml.model.ModelSet

a ModelSet object that contains the model as well as the hyperparameters and a create_model.

required
base_folder str

working directory where all the artifacts are being perserved.

required

export_results(self) ¤

export_results saves the necessary artifacts

Source code in haferml/model/pipeline.py
def export_results(self):
    """
    export_results saves the necessary artifacts
    """
    model_artifacts = self.artifacts["model"]
    model_folder = model_artifacts["local_absolute"]
    model_path = model_artifacts["name_absolute"]

    if not os.path.exists(model_folder):
        os.makedirs(model_folder)

    logger.info("Preserving models ...")
    joblib.dump(self.ModelSet.model, model_path)

    logger.info("Perserving logs ...")
    log_file_path = f"{model_path}.log"
    logger.info(f"Save log file to {log_file_path}")
    with open(log_file_path, "a+") as fp:
        json.dump(self.report, fp, default=isoencode)
        fp.write("\n")
    logger.info(f"Saved logs")

fit_and_report(self) ¤

_fit_and_report fits the model using input data and generate reports

Source code in haferml/model/pipeline.py
def fit_and_report(self):
    """
    _fit_and_report fits the model using input data and generate reports
    """

    logger.info("Fitting the model ...")
    logger.debug(
        "Shape of train data:\n"
        f"X_train: {self.DataSet.X_train.shape}, {self.DataSet.X_train.sample(3)}\n"
        f"y_train: {self.DataSet.y_train.shape}, {self.DataSet.y_train.sample(3)}"
    )
    self.ModelSet.model.fit(
        self.DataSet.X_train.squeeze(), self.DataSet.y_train.squeeze()
    )

    self.report = {
        "hyperparameters": self.ModelSet.hyperparameters,
        "best_params": self.ModelSet.model.best_params_,
        "cv_results": self.ModelSet.model.cv_results_,
    }

    logger.debug(self.report)

train(self, dataset) ¤

train connects the training workflow

Parameters:

Name Type Description Default
dataset pandas.DataFrame

dataframe being used to train the model

required
Source code in haferml/model/pipeline.py
def train(self, dataset):
    """
    train connects the training workflow

    :param dataset: dataframe being used to train the model
    :type dataset: pandas.DataFrame
    """

    logger.info("1. Create train test datasets")
    self.DataSet.create_train_test_datasets(dataset)
    logger.info("2. Create model")
    self.ModelSet.create_model()
    logger.info("3. Fit model and report")
    self.fit_and_report()
    logger.info("4. Export results")
    self.export_results()