qsarify.feature_selection_single module

Single-Threaded Feature Selection Module

This module contains the single-threaded version of the feature selection algorithm, which is a genetic algorithm that uses a linear regression model to score each set of features, using the output of clustering to ensure that the features are not redundant.

qsarify.feature_selection_single.mlr_selection(X_data, y_data, cluster_info, component, model='regression', learning=50000, bank=200, interval=1000)[source]

Performs feature selection using a using a linear regression model and a genetic algorithm on a single thread. This is the vanilla version of the algorithm, which is not parallelized.

Parameters:

X_data (DataFrame, descriptor data) –
y_data (DataFrame, target data) –
cluster_info (dict, descriptor cluster information) –
component (int, number of features to select) –
model (str, learning algorithm to use, default = "regression") –
learning (int, number of iterations to perform, default = 50000) –
bank (int, number of models to keep in the bank, default = 200) –
interval (int, number of iterations to perform before printing the current time, default = 1000) –

Returns:

best_model (list, best model found)
best_score (float, best score found)

qsarify.feature_selection_single.rf_selection(X_data, y_data, cluster_info, component, model='regression', learning=50000, bank=200, interval=1000)[source]

Performs feature selection using a using a random forest model and a genetic algorithm on a single thread. This is the vanilla version of the algorithm, which is not parallelized.

Parameters:

X_data (DataFrame, descriptor data) –
y_data (DataFrame, target data) –
cluster_info (dict, descriptor cluster information) –
component (int, number of features to select) –
model (str, learning algorithm to use, default = "regression") –
learning (int, number of iterations to perform, default = 50000) –
bank (int, number of models to keep in the bank, default = 200) –
interval (int, number of iterations to perform before printing the current time, default = 1000) –

Returns:

best_model (list, best model found)
best_score (float, best score found)