qsarify.feature_selection_multi module

Multi-Processing Feature Selection Module

This module contains the functions for performing feature selection using the clustering module’s output as a guide for feature selection, and implements a genetic algorithm for feature selection using reflection.

class qsarify.feature_selection_multi.Evolution(evolve)[source]

Bases: object

Initializes the evolution class with the learning algorithm to be used

evolve(cluster_info, cluster, X_data, y_data, e_mlr)[source]

qsarify.feature_selection_multi.selection(X_data, y_data, cluster_info, model='regression', learning=500000, bank=200, component=4, interval=1000, cores=95)[source]

Forward feature selection using cophenetically correlated data on mutliple cores

Parameters:

X_data (pandas DataFrame , shape = (n_samples, n_features)) –
y_data (pandas DataFrame , shape = (n_samples,)) –
cluster_info (dictionary returned by clustering.featureCluster.set_cluster()) –
model (default="regression", otherwise "classification") –
learning (default=500000, number of overall models to be trained) –
bank (default=200, number of models to be trained in each iteration) –
component (default=4, number of features to be selected) –
interval (optional, default=1000, print current scoring and selected features) – every interval
cores (optional, default=(mp.cpu_count()*2)-1, number of processes to be used) – for multiprocessing; default is twice the number of cores minus 1, which is assuming you have SMT, HT, or something similar) If you have a large number of cores, you may want to set this to a lower number to avoid memory issues.

Return type:

list, result of selected best feature set