qsarify.feature_selection_single module

Single-Threaded Feature Selection Module

This module contains the single-threaded version of the feature selection algorithm, which is a genetic algorithm that uses a linear regression model to score each set of features, using the output of clustering to ensure that the features are not redundant.

qsarify.feature_selection_single.mlr_selection(X_data, y_data, cluster_info, component, model='regression', learning=50000, bank=200, interval=1000)[source]

Performs feature selection using a using a linear regression model and a genetic algorithm on a single thread. This is the vanilla version of the algorithm, which is not parallelized.

Parameters:
  • X_data (DataFrame, descriptor data) –

  • y_data (DataFrame, target data) –

  • cluster_info (dict, descriptor cluster information) –

  • component (int, number of features to select) –

  • model (str, learning algorithm to use, default = "regression") –

  • learning (int, number of iterations to perform, default = 50000) –

  • bank (int, number of models to keep in the bank, default = 200) –

  • interval (int, number of iterations to perform before printing the current time, default = 1000) –

Returns:

  • best_model (list, best model found)

  • best_score (float, best score found)

qsarify.feature_selection_single.rf_selection(X_data, y_data, cluster_info, component, model='regression', learning=50000, bank=200, interval=1000)[source]

Performs feature selection using a using a random forest model and a genetic algorithm on a single thread. This is the vanilla version of the algorithm, which is not parallelized.

Parameters:
  • X_data (DataFrame, descriptor data) –

  • y_data (DataFrame, target data) –

  • cluster_info (dict, descriptor cluster information) –

  • component (int, number of features to select) –

  • model (str, learning algorithm to use, default = "regression") –

  • learning (int, number of iterations to perform, default = 50000) –

  • bank (int, number of models to keep in the bank, default = 200) –

  • interval (int, number of iterations to perform before printing the current time, default = 1000) –

Returns:

  • best_model (list, best model found)

  • best_score (float, best score found)