qsarify.data_tools module
Data Preprocessing Module
- This module contains functions for data preprocessing, including:
removing features with ‘NaN’ as value
removing features with constant values
removing features with low variance
removing features with ‘NaN’ as value when calculating correlation coefficients
generating a sequential train-test split by sorting the data by response variable
generating a random train-test split
scaling data
The main function of this module is clean_data, which performs all of the above functions.
- qsarify.data_tools.clean_data(X_data, y_data, split='sorted', test_size=0.2, cutoff=None, plot=False)[source]
Perform the entire data cleaning process as one function Optionally, plot the correlation matrix
- Parameters:
X_data (pandas DataFrame, shape = (n_samples, n_features)) –
split (string, optional, 'sorted' or 'random') –
test_size (float, optional, default = 0.2) –
cutoff (float, optional, auto-correlaton coefficient below which we keep) –
plot (boolean, optional, default = False) –
- Returns:
X_train (pandas DataFrame , shape = (n_samples, m_features))
X_test (pandas DataFrame , shape = (p_samples, m_features))
y_train (pandas DataFrame , shape = (n_samples, 1))
y_test (pandas DataFrame , shape = (p_samples, 1))
- qsarify.data_tools.random_split(X_data, y_data, test_size=0.2)[source]
Generate a random train-test split
- Parameters:
X_data (pandas DataFrame , shape = (n_samples, m_features)) –
y_data (pandas DataFrame , shape = (n_samples, 1)) –
test_size (float, default = 0.2) –
Returns –
dataframe (-------give count of NaN in pandas) –
X_train (pandas DataFrame , shape = (n_samples, m_features)) –
X_test (pandas DataFrame , shape = (n_samples, m_features)) –
y_train (pandas DataFrame , shape = (n_samples, 1)) –
y_test (pandas DataFrame , shape = (n_samples, 1)) –
- qsarify.data_tools.rm_constant(X_data)[source]
Remove features with constant values
- Parameters:
X_data (pandas DataFrame , shape = (n_samples, n_features)) –
- Return type:
Modified DataFrame
- qsarify.data_tools.rm_lowVar(X_data, cutoff=0.9)[source]
Remove features with low variance
- Parameters:
X_data (pandas DataFrame , shape = (n_samples, n_features)) –
cutoff (float, default = 0.1) –
- Return type:
Modified DataFrame
- qsarify.data_tools.rm_nan(X_data)[source]
Remove features with ‘NaN’ as value
- Parameters:
X_data (pandas DataFrame , shape = (n_samples, n_features)) –
- Return type:
Modified DataFrame
- qsarify.data_tools.rm_nanCorr(X_data)[source]
Remove features with ‘NaN’ as value when calculating correlation coefficients
- Parameters:
X_data (pandas DataFrame , shape = (n_samples, n_features)) –
- Return type:
Modified DataFrame
- qsarify.data_tools.scale_data(X_train, X_test)[source]
Scale the data using the training data; apply the same transformation to the test data
- Parameters:
X_train (pandas DataFrame , shape = (n_samples, m_features)) –
X_test (pandas DataFrame , shape = (p_samples, m_features)) –
- Returns:
X_train_scaled (pandas DataFrame , shape = (n_samples, m_features))
X_test_scaled (pandas DataFrame , shape = (p_samples, m_features))
- qsarify.data_tools.sorted_split(X_data, y_data, test_size=0.2)[source]
Generate a sequential train-test split by sorting the data by response variable
- Parameters:
X_data (pandas DataFrame , shape = (n_samples, m_features)) –
y_data (pandas DataFrame , shape = (n_samples, 1)) –
test_size (float, default = 0.2) –
- Returns:
X_train (pandas DataFrame , shape = (n_samples, m_features))
X_test (pandas DataFrame, shape = (n_samples, m_features))
y_train (pandas DataFrame , shape = (n_samples, 1))
y_test (pandas DataFrame , shape = (n_samples, 1))