model.pipeline
Model - Workflow¤
DataSetX (DataSet)
¤
DataSetX class deals with the dataset to be used in a model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
dict |
a dictionary that contains the configurations. |
required |
base_folder |
str |
working directory where all the artifacts are being perserved. |
required |
_export_train_test_data(self)
private
¤
_export_train_test_data saves train and test datasets
Source code in haferml/model/pipeline.py
def _export_train_test_data(self):
"""
_export_train_test_data saves train and test datasets
"""
dataset_folder = self.artifacts["dataset"]["local_absolute"]
## Save the train and test datasets
logger.info("Export test and train data")
self._save_data(
self.X_train, os.path.join(dataset_folder, "model_X_train.parquet")
)
self._save_data(
self.X_test, os.path.join(dataset_folder, "model_X_test.parquet")
)
self._save_data(
pd.DataFrame(self.y_train, columns=self.targets),
os.path.join(dataset_folder, "model_y_train.parquet"),
)
self._save_data(
pd.DataFrame(self.y_test, columns=self.targets),
os.path.join(dataset_folder, "model_y_test.parquet"),
)
# Save dataset locally
self._save_data(self.data, os.path.join(dataset_folder, "dataset.parquet"))
_save_data(dataframe, destination)
private
staticmethod
¤
_save_data
saves the dataframe locally.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataframe |
pandas.DataFrame |
dataframe to be saved |
required |
destination |
str |
where the data is saved |
required |
Source code in haferml/model/pipeline.py
@staticmethod
def _save_data(dataframe, destination):
"""
`_save_data` saves the dataframe locally.
:param dataframe: dataframe to be saved
:type dataframe: pandas.DataFrame
:param destination: where the data is saved
:type destination: str
"""
logger.info("Export test and train data")
dataframe.to_parquet(destination)
create_train_test_datasets(self, dataframe)
¤
create_train_test_datasets will create
- self.data: the full input data right before train test split
- self.X_train, self.y_train, self.X_test, self.y_test
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataframe |
pandas.DataFrame |
the dataframe to be splitted. |
required |
Source code in haferml/model/pipeline.py
def create_train_test_datasets(self, dataframe):
"""
create_train_test_datasets will create
- self.data: the full input data right before train test split
- self.X_train, self.y_train, self.X_test, self.y_test
:param dataframe: the dataframe to be splitted.
:type dataframe: pandas.DataFrame
"""
raise Exception("create_train_test_dataset has not yet been implemented!")
ModelSetX (ModelSet)
¤
The core of the model including hyperparameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
dict |
a dictionary that contains the configurations. |
required |
base_folder |
str |
working directory where all the artifacts are being perserved. |
required |
hyperparameters
property
readonly
¤
hyperparameters
specifies the hyperparameters. This is a property.
_set_hyperparameters(self)
private
¤
_set_hyperparameters creates hyperpamater grid
Source code in haferml/model/pipeline.py
def _set_hyperparameters(self):
"""
_set_hyperparameters creates hyperpamater grid
"""
...
create_model(self)
¤
create_model
creates the model and updates the property self.model
.
Source code in haferml/model/pipeline.py
def create_model(self):
...
ModelWorkflowX (ModelWorkflow)
¤
ModelWorkflowX class that holds DataSetX and ModelSetX
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
dict |
a dictionary that contains the configs. |
required |
dataset |
haferml.model.DataSet |
a DataSet object that contains the data and provides a |
required |
modelset |
haferml.model.ModelSet |
a ModelSet object that contains the model as well as the hyperparameters and a |
required |
base_folder |
str |
working directory where all the artifacts are being perserved. |
required |
export_results(self)
¤
export_results saves the necessary artifacts
Source code in haferml/model/pipeline.py
def export_results(self):
"""
export_results saves the necessary artifacts
"""
model_artifacts = self.artifacts["model"]
model_folder = model_artifacts["local_absolute"]
model_path = model_artifacts["name_absolute"]
if not os.path.exists(model_folder):
os.makedirs(model_folder)
logger.info("Preserving models ...")
joblib.dump(self.ModelSet.model, model_path)
logger.info("Perserving logs ...")
log_file_path = f"{model_path}.log"
logger.info(f"Save log file to {log_file_path}")
with open(log_file_path, "a+") as fp:
json.dump(self.report, fp, default=isoencode)
fp.write("\n")
logger.info(f"Saved logs")
fit_and_report(self)
¤
_fit_and_report fits the model using input data and generate reports
Source code in haferml/model/pipeline.py
def fit_and_report(self):
"""
_fit_and_report fits the model using input data and generate reports
"""
logger.info("Fitting the model ...")
logger.debug(
"Shape of train data:\n"
f"X_train: {self.DataSet.X_train.shape}, {self.DataSet.X_train.sample(3)}\n"
f"y_train: {self.DataSet.y_train.shape}, {self.DataSet.y_train.sample(3)}"
)
self.ModelSet.model.fit(
self.DataSet.X_train.squeeze(), self.DataSet.y_train.squeeze()
)
self.report = {
"hyperparameters": self.ModelSet.hyperparameters,
"best_params": self.ModelSet.model.best_params_,
"cv_results": self.ModelSet.model.cv_results_,
}
logger.debug(self.report)
train(self, dataset)
¤
train connects the training workflow
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
pandas.DataFrame |
dataframe being used to train the model |
required |
Source code in haferml/model/pipeline.py
def train(self, dataset):
"""
train connects the training workflow
:param dataset: dataframe being used to train the model
:type dataset: pandas.DataFrame
"""
logger.info("1. Create train test datasets")
self.DataSet.create_train_test_datasets(dataset)
logger.info("2. Create model")
self.ModelSet.create_model()
logger.info("3. Fit model and report")
self.fit_and_report()
logger.info("4. Export results")
self.export_results()