geogenie package

Subpackages

Submodules

geogenie.cli module

geogenie.cli.main()[source]

geogenie.geogenie module

class geogenie.geogenie.GeoGenIE(args)[source]

Bases: object

A class designed for predicting geographic localities from genomic SNP (Single Nucleotide Polymorphism) data using neural network and gradient boosting decision tree models.

GeoGenIE facilitates the integration of genomic data analysis with geographic predictions, aiding in studies like population genetic and molecular ecology.

args

Command line arguments containing configurations for data processing, model training, and visualization.

Type:

argparse.Namespace

genotypes

Genomic SNP data.

Type:

np.ndarray

samples

Sample IDs.

Type:

np.ndarray

sample_data

Dictionary containing sample data.

Type:

dict

locs

Geographic coordinates of the samples.

Type:

np.ndarray

dtype

Data type for PyTorch tensors.

Type:

torch.dtype

logger

Logger object for logging messages.

Type:

logging.Logger

device

Device to run the computations on (CPU or GPU).

Type:

str

plotting

PlotGenIE object for generating plots.

Type:

PlotGenIE

boot

Bootstrap object for bootstrapping predictions.

Type:

Bootstrap

Notes

  • This class is particularly useful in the fields of population genomics, evolutionary biology, and molecular ecology, where geographic predictions based on genomic data are crucial.

  • It requires genomic SNP data as input and utilizes neural network models for making geographic predictions.

  • The class supports data preprocessing, model training, and visualization of the results.

  • It also provides support for bootstrapping predictions and optimizing hyperparameters using Optuna.

  • The class is designed to handle large-scale genomic datasets and perform geographic predictions efficiently.

_batch_init(model, batch)[source]

Initializes the batch for training or validation.

Parameters:
  • model (torch.nn.Module) – The PyTorch model to train or validate.

  • batch (tuple) – A tuple containing data, targets, and sample weights.

Returns:

A tuple containing data, targets, and sample weights as tensors moved to the appropriate device.

Return type:

tuple

_create_metrics_dictionary(values)[source]

Creates a dictionary for metrics from a list of values.

Parameters:

values (list) – List of values in the specified order, with ‘percentiles’ being a NumPy array at index 16.

Returns:

Dictionary with metrics.

Return type:

dict

calculate_prediction_metrics(outfile, predictions, ground_truth, log_stats, bootstrap, is_train=False, dataset=None)[source]

Calculate metrics for the predictions and ground truth.

Parameters:
  • outfile (str) – The path to the file where the metrics will be saved.

  • predictions (np.ndarray) – Predictions from the model.

  • ground_truth (np.ndarray) – Ground truth values.

  • log_stats (bool) – If True, log the prediction metrics.

  • bootstrap (bool) – If True, perform bootstrap predictions.

  • is_train (bool, optional) – If True, indicates that the data is training data. Defaults to False.

  • dataset (str, optional) – The dataset being used. Defaults to None.

Returns:

A tuple containing the predictions, ground truth, and metrics dictionary.

Return type:

tuple

compute_rolling_statistics(times, window_size)[source]

Computes rolling average and standard deviation over a specified window size.

Parameters:
  • times (list or array-like) – A sequence of numerical values (e.g., times or scores).

  • window_size (int) – The number of elements to consider for each rolling window.

Returns:

A tuple containing two lists:
  • averages (list): The rolling averages.

  • std_devs (list): The rolling standard deviations.

Return type:

tuple

Notes

  • This method is useful for analyzing time series data where you need to smooth out short-term fluctuations and highlight longer-term trends or cycles.

evaluate_and_save_results(model, train_losses, val_losses, dataset='val', centroids=None, use_rf=False)[source]

Evaluate the model and save the results.

Parameters:
  • model (torch.nn.Module) – The trained model to evaluate.

  • train_losses (list) – List of training losses.

  • val_losses (list) – List of validation losses.

  • dataset (str) – The dataset to use for evaluation (‘val’ or ‘test’).

  • centroids (np.ndarray, optional) – Centroids if using synthetic resampling. Defaults to None.

  • use_rf (bool, optional) – If True, use RandomForest model. Defaults to False.

extract_best_params(best_params, model)[source]

Extracts the best parameters for the optimizer.

Parameters:
  • best_params (dict) – Dictionary of best parameters.

  • model (torch.nn.Module) – The PyTorch model to train.

Returns:

Optimizer for the model.

Return type:

torch.optim.Optimizer

get_all_stats(predictions, ground_truth, mad, coefficient_of_variation, within_threshold)[source]

Computes various statistics for predictions and ground truth.

Parameters:
  • predictions (np.ndarray) – Predictions from the model.

  • ground_truth (np.ndarray) – Ground truth values.

  • mad (callable) – Function to compute median absolute deviation.

  • coefficient_of_variation (callable) – Function to compute coefficient of variation.

  • within_threshold (callable) – Function to compute percentage within a threshold.

Returns:

A tuple containing z-scores, a list of statistical values, and haversine errors.

Return type:

tuple

get_correlation_coef(predictions, ground_truth, corr_func)[source]

Computes correlation coefficients for predictions and ground truth.

Parameters:
  • predictions (np.ndarray) – Predictions from the model.

  • ground_truth (np.ndarray) – Ground truth values.

  • corr_func (callable) – Function to compute correlation coefficients.

Returns:

A tuple containing correlation coefficients and p-values for both longitude and latitude.

Return type:

tuple

load_best_params(filename)[source]

Loads the best parameters from a file.

Parameters:

filename (str) – The path to the file containing the best parameters.

Returns:

Dictionary of best parameters.

Return type:

dict

Raises:
  • FileNotFoundError – If the file does not exist.

  • TypeError – If the best parameters are not in the expected format.

load_data()[source]

Loads genotypes from VCF file using pysam, then preprocesses the data by imputing, embedding, and transforming the input data.

make_unseen_predictions(model, device, use_rf=False, col_indices=None, boot_rep=None)[source]

Makes predictions on unseen data.

Parameters:
  • model (torch.nn.Module) – The trained model to use for predictions.

  • device (torch.device) – Device for training (‘cpu’ or ‘cuda’).

  • use_rf (bool, optional) – If True, use RandomForest model. Defaults to False.

  • col_indices (list, optional) – List of column indices to use for predictions. Defaults to None.

  • boot_rep (int, optional) – Bootstrap repetition index. Defaults to None.

Returns:

DataFrame containing the predicted locations if boot_rep is None. tuple: Tuple containing predicted locations and output file path if boot_rep is not None.

Return type:

pandas.DataFrame

optimize_parameters(ModelClass)[source]

Perform parameter optimization using Optuna.

Parameters:

ModelClass (torch.nn.Module) – The PyTorch model class for which the optimization is to be done.

Returns:

Best parameters found by Optuna optimization.

Return type:

dict

perform_bootstrap_training(ModelClass, best_params)[source]

Perform bootstrap training using the provided parameters.

Parameters:
  • ModelClass (torch.nn.Module) – The PyTorch model class to use.

  • best_params (dict) – Dictionary of best parameters found by Optuna or specified by the user.

perform_standard_training(train_loader, val_loader, device, best_params, ModelClass)[source]

Perform standard model training.

Parameters:
  • train_loader (torch.utils.data.DataLoader) – DataLoader for the training set.

  • val_loader (torch.utils.data.DataLoader) – DataLoader for the validation set.

  • device (torch.device) – Device for training (‘cpu’ or ‘cuda’).

  • best_params (dict) – Dictionary of parameters to use with model training.

  • ModelClass (torch.nn.Module) – Callable subclass for PyTorch model.

Returns:

A tuple containing the best model, training losses, validation losses, and centroids if any.

Return type:

tuple

plot_bootstrap_aggregates(df, train_times)[source]

Plots bootstrap aggregates and training times.

Parameters:
  • df (pandas.DataFrame) – DataFrame containing bootstrap aggregates.

  • train_times (list) – List of training times.

predict_locations(model, data_loader, outfile, return_truths=False, use_rf=False, log_metrics=True, bootstrap=False, is_train=False, dataset=None, is_val=True)[source]

Predict locations using the trained model and evaluate predictions.

Parameters:
  • model (torch.nn.Module) – The trained model to use for predictions.

  • data_loader (torch.utils.data.DataLoader) – DataLoader for the prediction data.

  • outfile (str) – The path to the file where the predictions will be saved.

  • return_truths (bool, optional) – If True, return the ground truth values along with predictions. Defaults to False.

  • use_rf (bool, optional) – If True, use RandomForest model. Defaults to False.

  • log_metrics (bool, optional) – If True, log the prediction metrics. Defaults to True.

  • bootstrap (bool, optional) – If True, perform bootstrap predictions. Defaults to False.

  • is_train (bool, optional) – If True, indicates that the data is training data. Defaults to False.

  • dataset (str, optional) – The dataset being used. Defaults to None.

  • is_val (bool, optional) – If True, indicates that the data is validation data. Defaults to True.

Returns:

Predictions from the model. (optional) dict: Metrics related to the predictions if return_truths is True. (optional) numpy.ndarray: Ground truth values if return_truths is True.

Return type:

numpy.ndarray

print_stats_to_logger(metrics)[source]

Logs the computed metrics to the logger.

Parameters:

metrics (dict) – Dictionary of metrics to log.

save_model(model, filename)[source]

Saves the trained model to a file.

Parameters:
  • model (torch.nn.Module) – The trained PyTorch model to save.

  • filename (str) – The path to the file where the model will be saved.

test_step(val_loader, model)[source]

Performs a single validation step.

Parameters:
  • val_loader (torch.utils.data.DataLoader) – DataLoader for validation data.

  • model (torch.nn.Module) – The PyTorch model to validate.

Returns:

The average validation loss.

Return type:

float

total_execution_time_decorator(func)[source]

Decorator to time the execution of a function and print the duration in Hours:Minutes:Seconds format. Logs the execution time using the class’s logger.

Parameters:

func (callable) – The function to be timed.

Returns:

The wrapped function with execution time measurement.

Return type:

callable

train_model(train_loader, val_loader, model, optimizer, trial=None, objective_mode=False, do_bootstrap=False, early_stopping=None, lr_scheduler=None)[source]

Trains a given model using specified parameters and data loaders.

This method supports early stopping and learning rate scheduling, and evaluates the model’s performance on the validation dataset.

Parameters:
  • train_loader (torch.utils.data.DataLoader) – DataLoader for training data.

  • val_loader (torch.utils.data.DataLoader) – DataLoader for validation data.

  • model (torch.nn.Module) – The PyTorch model to train.

  • optimizer (torch.optim.Optimizer) – The optimizer for training the model.

  • trial (optuna.trial.Trial, optional) – Optuna trial object for hyperparameter optimization. Defaults to None.

  • objective_mode (bool, optional) – If True, the method is used for optimization objectives. Defaults to False.

  • do_bootstrap (bool, optional) – If True, performs bootstrap training. Defaults to False.

  • early_stopping (EarlyStopping, optional) – Early stopping callback. Defaults to None.

  • lr_scheduler (torch.optim.lr_scheduler, optional) – Learning rate scheduler. Defaults to None.

Returns:

Depending on the mode, returns trained model and additional training information or None if training failed.

Return type:

tuple or None

train_rf(clf_params, objective_mode=False)[source]

Trains an XGBRegressor model using the specified parameters and data loaders.

The method supports data augmentation using SMOTE (Synthetic Minority Over-sampling Technique) and evaluates the model’s performance using Root Mean Squared Error (RMSE).

Parameters:
  • clf_params (dict) – Parameters for Random Forest or Gradient Boosting models.

  • objective_mode (bool, optional) – If True, the method is used for optimization objectives. Defaults to False.

Returns:

The trained model. float: RMSE of the model on the validation set. (additional returns if not in objective_mode): Additional data related to model training and evaluation.

Return type:

RandomForestRegressor or XGBRegressor

Notes

  • The function first checks if SMOTE is to be applied and performs data augmentation accordingly.

  • Depending on the configuration, either a Random Forest or a Gradient Boosting model is trained.

  • The performance of the trained model is evaluated using RMSE on the validation dataset.

train_step(train_loader, model, optimizer, grad_clip, objective_mode)[source]

Performs a single training step.

Parameters:
  • train_loader (torch.utils.data.DataLoader) – DataLoader for training data.

  • model (torch.nn.Module) – The PyTorch model to train.

  • optimizer (torch.optim.Optimizer) – The optimizer for training the model.

  • grad_clip (float) – Value for gradient clipping.

  • objective_mode (bool) – If True, the method is used for optimization objectives.

Returns:

The average training loss or None if training failed.

Return type:

float or None

train_test_predict()[source]

Perform the complete training, testing, and prediction pipeline.

This method sets the seed, initializes the device, loads the data, performs parameter optimization, trains the model, evaluates and saves results, and makes predictions on unseen data.

visualize_oversampling(features, labels, sample_weights, df, bins_resampled)[source]

Visualizes the effect of SMOTE (Synthetic Minority Over-sampling Technique) on the dataset.

This method creates a visual comparison of the original and the oversampled datasets.

Parameters:
  • features (np.array or pandas.DataFrame) – The feature set, either as a NumPy array or a DataFrame.

  • labels (np.array or pandas.DataFrame) – The label set, expected to contain ‘x’ and ‘y’ coordinates.

  • sample_weights (np.array) – Array of sample weights.

  • df (pandas.DataFrame) – Original DataFrame before applying SMOTE.

  • bins_resampled (array-like) – Array of bin labels for the data after applying SMOTE.

Notes

  • The method first validates and converts the features, labels, and sample weights to pandas DataFrames.

  • It then combines these into a single DataFrame and passes this to the plot_smote_bins method of the PlotGenIE class.

  • This visualization helps in understanding how SMOTE affects the distribution of samples across different geographical bins.

write_pred_locations(pred_locations, filename, sample_ids)[source]

Writes predicted locations to a file.

Parameters:
  • pred_locations (np.ndarray) – Array of predicted locations. Expects shape (n_samples, 2).

  • filename (str) – The path to the file where predictions will be saved.

  • sample_ids (np.ndarray) – Array of sample IDs.

Returns:

DataFrame containing the predicted locations.

Return type:

pandas.DataFrame

geogenie.geogenie.save_execution_times(filename)[source]

Appends the execution times to a CSV file. If the file doesn’t exist, it creates one.

Parameters:

filename (str) – The name of the file where data will be saved.

geogenie.geogenie.timer(func)[source]

Decorator that measures and stores the execution time of a function.

Parameters:

func (Callable) – The function to be wrapped by the timer.

Returns:

The wrapped function with timing functionality.

Return type:

Callable

Module contents