qsarify.data_tools module

Data Preprocessing Module

This module contains functions for data preprocessing, including:
  • removing features with ‘NaN’ as value

  • removing features with constant values

  • removing features with low variance

  • removing features with ‘NaN’ as value when calculating correlation coefficients

  • generating a sequential train-test split by sorting the data by response variable

  • generating a random train-test split

  • scaling data

The main function of this module is clean_data, which performs all of the above functions.

qsarify.data_tools.clean_data(X_data, y_data, split='sorted', test_size=0.2, cutoff=None, plot=False)[source]

Perform the entire data cleaning process as one function Optionally, plot the correlation matrix

Parameters:
  • X_data (pandas DataFrame, shape = (n_samples, n_features)) –

  • split (string, optional, 'sorted' or 'random') –

  • test_size (float, optional, default = 0.2) –

  • cutoff (float, optional, auto-correlaton coefficient below which we keep) –

  • plot (boolean, optional, default = False) –

Returns:

  • X_train (pandas DataFrame , shape = (n_samples, m_features))

  • X_test (pandas DataFrame , shape = (p_samples, m_features))

  • y_train (pandas DataFrame , shape = (n_samples, 1))

  • y_test (pandas DataFrame , shape = (p_samples, 1))

qsarify.data_tools.random_split(X_data, y_data, test_size=0.2)[source]

Generate a random train-test split

Parameters:
  • X_data (pandas DataFrame , shape = (n_samples, m_features)) –

  • y_data (pandas DataFrame , shape = (n_samples, 1)) –

  • test_size (float, default = 0.2) –

  • Returns

  • dataframe (-------give count of NaN in pandas) –

  • X_train (pandas DataFrame , shape = (n_samples, m_features)) –

  • X_test (pandas DataFrame , shape = (n_samples, m_features)) –

  • y_train (pandas DataFrame , shape = (n_samples, 1)) –

  • y_test (pandas DataFrame , shape = (n_samples, 1)) –

qsarify.data_tools.rm_constant(X_data)[source]

Remove features with constant values

Parameters:

X_data (pandas DataFrame , shape = (n_samples, n_features)) –

Return type:

Modified DataFrame

qsarify.data_tools.rm_lowVar(X_data, cutoff=0.9)[source]

Remove features with low variance

Parameters:
  • X_data (pandas DataFrame , shape = (n_samples, n_features)) –

  • cutoff (float, default = 0.1) –

Return type:

Modified DataFrame

qsarify.data_tools.rm_nan(X_data)[source]

Remove features with ‘NaN’ as value

Parameters:

X_data (pandas DataFrame , shape = (n_samples, n_features)) –

Return type:

Modified DataFrame

qsarify.data_tools.rm_nanCorr(X_data)[source]

Remove features with ‘NaN’ as value when calculating correlation coefficients

Parameters:

X_data (pandas DataFrame , shape = (n_samples, n_features)) –

Return type:

Modified DataFrame

qsarify.data_tools.scale_data(X_train, X_test)[source]

Scale the data using the training data; apply the same transformation to the test data

Parameters:
  • X_train (pandas DataFrame , shape = (n_samples, m_features)) –

  • X_test (pandas DataFrame , shape = (p_samples, m_features)) –

Returns:

  • X_train_scaled (pandas DataFrame , shape = (n_samples, m_features))

  • X_test_scaled (pandas DataFrame , shape = (p_samples, m_features))

qsarify.data_tools.sorted_split(X_data, y_data, test_size=0.2)[source]

Generate a sequential train-test split by sorting the data by response variable

Parameters:
  • X_data (pandas DataFrame , shape = (n_samples, m_features)) –

  • y_data (pandas DataFrame , shape = (n_samples, 1)) –

  • test_size (float, default = 0.2) –

Returns:

  • X_train (pandas DataFrame , shape = (n_samples, m_features))

  • X_test (pandas DataFrame, shape = (n_samples, m_features))

  • y_train (pandas DataFrame , shape = (n_samples, 1))

  • y_test (pandas DataFrame , shape = (n_samples, 1))