GeoGenIE Tutorial

Introduction

Welcome to the GeoGenIE tutorial! This guide will help you install and run GeoGenIE, a tool for predicting geographic provenance from genetic SNP data. GeoGenIE uses deep learning models to infer latitude and longitude origins from genetic data, addressing challenges like sampling imbalance and outliers.

GeoGenIE (Geographic-Genetic Inference Engine) is a sophisticated tool designed for predicting geographic provenance (i.e., latitude and longitude origin) from genetic SNP data using deep learning models. It incorporates advanced features such as:

Automated model parameter optimization using Optuna [ASY+20]. Optuna tunes hyperparameters to improve model performance. The parameter search is parallelized for faster optimization.
Outlier detection using the algorithm adapted from GGOutlieR [CS23].
Handling imbalanced sampling with regression-based synthetic oversampling, based on the SMOTE algorithm [CBHK02], and weighted loss functions.
Effective Visualizations and metrics for evaluating predictions.

GeoGenIE is tailored for researchers in population genetics and molecular ecology, addressing challenges like sampling imbalance and outliers while delivering robust geographic predictions.

Features at a Glance

GeoGenIE combines cutting-edge methodologies to deliver accurate and reproducible results. Key features include:

Scalable Architecture: Capable of handling large-scale SNP datasets with thousands of samples.
Geospatial Mapping: Provides informative maps of predicted versus actual sample locations.
Customizable Outputs: Supports tailored metrics and visualizations based on user-defined parameters.
Parallelized Bootstrapping: Offers built-in methods for robust model validation and uncertainty estimation. Bootstrapping is parallelized for faster execution.

Deep Learning Model Architecture

GeoGenIE’s deep learning architecture builds on Locator [BRK20], leveraging the flexibility of PyTorch. It incorporates:

Dynamic Neural Network Architecture: Configurable hidden layers with progressive reduction controlled by –factor.
Regularization Techniques: Includes dropout, batch normalization, and L2 regularization to enhance model generalizability.
Custom Loss Functions: Supports RMSE, Huber, and DRMS Loss to address specific prediction challenges.

Installation

Steps to Install GeoGenIE

Create a virtual or conda environment:

python -m venv geogenie_env
source geogenie_env/bin/activate

Install GeoGenIE:
pip install GeoGenIE

Dependencies

Dependencies like PyTorch, geopandas, scikit-learn, matplotlib, scipy, pykrige, numpy, and many others are automatically installed. Ensure Python ≥ 3.11 is used.

Usage

Running GeoGenIE

GeoGenIE can be run via:

YAML Configuration File (recommended):

geogenie --config config_files/config.yaml

Command-Line Arguments:

geogenie --vcf <vcf_file> --sample_data <coordinates_file>

Run geogenie -h to see all available arguments.

Configuration File Format

Boolean values: true or false.
None values: null.

Input Files

Argument	Description
vcf	VCF file with SNP data.
sample_data	CSV/TSV file with sample coordinates (sampleID, x, y). Unknowns as `nan`.
known_coords_file	File with known coordinates for samples.

Algorithms to Mitigate Sampling Imbalance

GeoGenIE addresses sampling imbalances with:

Outlier Detection: Removes anomalous samples (e.g., due to translocations).
Weighted Loss Function: Focuses training on underrepresented regions.
Synthetic Oversampling: Creates synthetic samples in sparsely sampled areas.

Features and Settings

Data Input and Preprocessing

Option	Description	Default
min_mac	Filters SNPs with minor allele counts below a threshold.	2
max_SNPs	Limits the number of SNPs used in analysis.	Use all SNPs

Model Configuration

Option	Description	Default
dropout_prop	Dropout rate to prevent overfitting.	0.25
n_layers	Number of hidden layers in the neural network.	10
width	Neurons per layer.	256
criterion	Loss function (rmse, huber, drms).	“rmse”

Tips:

Note

Start with default settings. Then adjust based on model performance.
Adjust dropout_prop for overfitting.
Increase n_layers for complex data.
Adjust width based on data complexity.
Use criterion to select the loss function.
Set min_mac based on data quality.
Use max_SNPs to limit the number of SNPs.
Use the known_coords_file to validate predictions. This file can be the same as the sample_data file.

Training Parameters

Option	Description	Default
max_epochs	Maximum training cycles.	5000
learning_rate	Learning rate for weight updates.	0.001
train_split	Proportion of data for training.	0.8
val_split	Proportion of data for validation set.	0.2
early_stop_patience	Epochs without improvement before stopping.	48
batch_size	Number of samples processed per iteration.	32
l2_reg	L2 regularization parameter.	0.0
factor	Factor for hidden layer size reduction.	0.5
lr_scheduler_patience \| Patience for learning rate scheduler.		10
dtype	Data type for model (float32, float64).	“float32”
grad_clip	Enable graadient clipping.	False

Note

Set max_epochs high and rely on early stopping to prevent overfitting. The default will suffice in most circumstances.
Use early_stop_patience to control how quickly early stopping occurs.
Adjust learning_rate based on model performance.
Use train_split and val_split to control training and validation data proportions. The default is 80/20. Make sure they sum to 1.0.
Set batch_size based on available memory and model performance.
Use l2_reg to prevent overfitting.
Adjust factor to control hidden layer size reduction.
Set lr_scheduler_patience to control learning rate decay.
Use dtype to set the model data type.
Enable grad_clip to prevent exploding gradients.

Outlier Detection

Option	Description	Default
detect_outliers	Enables outlier detection.	False
min_nn_dist	Minimum distance to flag outliers.	1000
significance_level	Statistical significance threshold.	0.05
maxk	Max number of neighbors for outlier test.	10

Bootstrapping

Enable bootstrapping for confidence intervals with:

geogenie --do_bootstrap --nboots 100

Confidence interval plots are saved in <output_dir>/plots/bootstrapped_sample_ci.

Hyperparameter Optimization

Optuna optimizes model hyperparameters automatically. The process is fully parallelized across multiple CPUs to speed up computation. The best parameters are saved in JSON format and can be reloaded in future analyses.

Optimize model hyperparameters with Optuna:

Option	Description	Default
do_gridsearch	Perform search for optimal hyperparameters	False
n_iter \| Number of optimization iterations.		100
sqldb	Path to SQLite database for Optuna.	None
load_best_params	Load best parameters from Optuna.	None

Plot Settings

Option \| Description		Default
show_plots \| Display plots interactively.		False
fontsize	Font size for plots.	24
filetype	File format for saved plots (e.g., png).	“png”
output_dir	Directory for saving output files.	“output”
prefix	Prefix for output files.	“geogenie”
remove_splines	Remove plot axis spines.	False
plot_dpi	Dots per inch for saved plots.	300
filetype	File format for saved plots (png, pdf, jpg)	“png”
shapefile	URL or path to shapefile for plotting.	None
basemap_fips	FIPS code for basemap.	“US”

Plot Examples

Below are examples of some of the plots generated by GeoGenIE:

Geographic Error Interpolated with Kriging — Geographic Error interpolated across the landscape with Kriging. The color scale represents Haversine distance between the predicted and recorded localities, in kilometers. Blue circles represnet the recorded sample locations.

Bootstrapped Sample CI — Confidence intervals for a representive sample among 100 bootstrap replicates. Gray circles represent the predicted sample locations, the “X” marks the centroid of the bootstrap replicates, and the black filled triangle represents the recorded sample location. The contours represent varying levels of bootstrap prediction densities.

Flagged Outliers — Detected outlier samples based on samples that deviate from expected genetic and geographic patterns. Outliers are flagged based on the minimum nearest neighbor distance and statistical significance threshold. The large orange circles represent removed outlier samples.

Regression Plot — Scatter plot of predicted versus recorded sample locations. The trend lines represents the linear and polynomial regression fit. The color scale represents the density of samples. The Pearson and Spearman correlation coefficients are displayed.

Synthetic Oversampling — Synthetic samples generated by the SMOTE-based algorithm to balance the distribution of samples across geographic space. The synthetic samples are shown as “x” marks. Real samples are shown as circles. The colors represent the samples grouped with the K-means clustering algorithm.

Sample Densities — Geographic distribution of samples in the test set. Circles represent sample locations, and the color scale and size of each point represents the inverse density of samples in the area, with larger and darker circles indicating higher inverse sample densities.

Bootstrapped Error Box plots — (Left subplot) Box plots of prediction errors for bootstrapped samples. The box represents the interquartile range, the line represents the median, and the whiskers represent the range of values. (Right subplot) Boxplots of Pearson and Spearman correlation coefficients for bootstrapped samples. Correlation coefficients are calculated between longitude and latitude of predicted and recorded sample locations.

Training History — Training and validation loss over epochs. The training loss is shown in blue, and the validation loss is shown in orange. The plot demonstrates the convergence of the model during training.

These plots provide insights into the model’s performance, error distributions, and geographic predictions. There are also additional plots not shown here available for visualizing model outputs and performance metrics.

Various aspects of the plots can be adjusted per user preference. For example, the font size, file type, and output directory can be customized to suit specific requirements, among other settings. See the config.yaml file for more options.

Output Files

Outputs are saved in <output_dir>/<subdirectory>/<prefix>_*. Directory structure:

Subdirectory	Description
plots/	Visualizations like error distributions.
bootstrapped_sample_ci/	Per-sample bootstrap CI plots
models/	Trained PyTorch models (`.pt` files).
logfiles/	Logs (INFO, WARNING, ERROR).
bootstrap_metrics/	Metrics for bootstrapped analysis
bootstrap_predictions/	Predictions for bootstrapped samples.
bootstrapped_summaries/	Aggregated summaries of bootstrapped samples.
data/	Detected outlier samples
optimize/	Optuna optimization logs
predictions/	Predictions for all unknown samples
test/	Test set predictions for model evaluation
training/	Training set metrics from model evaluation
validation/	Validation set metrics from model evaluation
benchmarking/	Benchmarking results (per bootstrap)

Metric Descriptions

GeoGenIE provides the below metrics for evaluating model performance:

Root Mean Squared Error (RMSLE): RMSE used in the loss function. Calculates the square root of the mean squared error between predicted and recorded values.
Mean Distance: Average Haversine distance between predicted and actual locations, in kilometers.
Median Distance: Median Haversine distance between predicted and actual locations, in kilometers.
Standard Deviation of Distance: Standard deviation of Haversine distances between predicted and actual locations, in kilometers.
Kolmogorov-Smirnov Statistic: Measures the similarity of two distributions.
Kolmogorov-Smirnov p-value: Statistical significance of the Kolmogorov-Smirnov test.
Skewness: Measure of asymmetry in the distribution of prediction errors.
Pearson Correlation: Measures the strength and direction of a linear relationship.
Spearman Correlation: Measures the strength and direction of a monotonic relationship.
Mean Absolute Z-Score: Average absolute z-score of prediction errors.
Percentile 25: 25th percentile of prediction errors.
Percentile 50: 50th percentile of prediction errors.
Percentile 75: 75th percentile of prediction errors.
Percent Within 20km: Percentage of predictions within 20km of actual locations.
Percent Within 50km: Percentage of predictions within 50km of actual locations.
Percent Within 75km: Percentage of predictions within 75km of actual locations.
Mean Absolute Deviation (MAD) of Haversine Distance: Average absolute deviation of Haversine distances between predicted and actual locations, in kilometers.
Coefficient of Variation: Measure of relative variability.
Interquartile Range: Range between the 25th and 75th percentiles.

Conclusion

GeoGenIE is a powerful tool for predicting geographic provenance from genetic SNP data. It offers advanced features like outlier detection, weighted loss functions, and synthetic oversampling to handle sampling imbalances. With a scalable architecture, customizable outputs, and parallelized bootstrapping, GeoGenIE delivers accurate and reproducible results. By leveraging deep learning models and Optuna hyperparameter optimization, GeoGenIE provides robust geographic predictions for researchers in population genetics and molecular ecology.

Glossary

Key terms include:

Dropout: A technique to prevent overfitting by randomly ignoring neurons.
Weighted Loss Function: Focuses on underrepresented data.
Haversine Formula: Calculates distances on a sphere using lat/long.
SMOTE: Synthetic oversampling to balance data.
Bootstrapping: Resampling technique for estimating confidence intervals.
Outlier Detection: Identifies unusual samples using statistical methods.
Deep Learning: Neural networks with multiple hidden layers.
Optuna: Hyperparameter optimization framework.
GGOutlieR: R package for identifying unusual geo-genetic patterns.
Synthetic Minority Over-sampling Technique (SMOTE): Balances imbalanced data.
Locator: Software for predicting geographic location from genetic data.
PyTorch: Deep learning library for Python.
DRMS Loss: Distance Root Mean Square loss function.
Huber Loss: Loss function for regression that combines RMSE and MAE.
Batch Normalization: Normalizes inputs to improve training.
Minor Allele Count (MAC): Number of minor alleles in a population.
Early Stopping: Stops training when model performance plateaus.
Confidence Interval: Range of values where true value is likely to fall.
Font Size: Size of text in plots.
File Type: Format for saving plots (e.g., png).
Significance Level: Threshold for statistical significance.
Output Directory: Folder for saving results.
Subdirectory: Folder within the output directory.
Prefix: Prefix for output files.
Learning Rate Rate at which model weights are updated.
Training Split: Proportion of data used for training.
Early Stop Patience: Number of epochs without improvement before stopping.
Minimum Nearest Neighbor Distance: Minimum distance to flag outliers.
Statistical Significance Threshold: Threshold for outlier detection.
Bootstrap: Resampling technique for estimating confidence intervals.
Error Distribution: Distribution of prediction errors.
Model Directory: Folder for saving trained models.
Log Directory: Folder for saving logs.
Visualizations: Graphical representations of data.
Error Metrics: Measures of model performance.
Prediction Error: Difference between predicted and actual values.
Monotonic Relationship: Relationship where one variable increases as the other decreases.
Distance Calculation: Calculation of distances between points.
Percentage Within Distance: Percentage of predictions within specified distances.
Hyperparameter Optimization: Process of tuning model parameters to improve performance.
Sampling Imbalance: Unequal distribution of data.
Anomalous Samples: Unusual or outlying data points.
Translocations: Movement of samples to different locations.
Underrepresented Regions: Areas with few samples.
Synthetic Samples: Artificially generated data points.
Error Distributions: Distribution of prediction errors.
Interactively: In a way that allows user interaction.
Saved Plots: Graphical representations of data saved to files.
Directory Structure: Organization of files and folders.
L2 Regularization: Technique to prevent overfitting by penalizing large weights.
Batch Size: Number of samples processed in each training iteration.
Epoch: One complete pass through the training data.
Training Cycle: Iteration of training epochs.
Neurons: Units in a neural network that process input data.
Hidden Layers: Layers between input and output layers in a neural network.
Overfitting: Model that performs well on training data but poorly on test data.
Loss Function: Measures the difference between predicted and actual values.
Synthetic Oversampling: Technique to balance data by generating synthetic samples.
Parallelized: Process that runs multiple tasks simultaneously.
Reproducible Results: Results that can be replicated by others.
Geographic Provenance: Origin or source of geographic data.
Deep Learning Models: Neural networks with multiple layers.
Scalable Architecture: Design that can handle large datasets.
Customizable Outputs: Outputs that can be tailored to user needs.
Parallelized Bootstrapping: Resampling technique that runs in parallel.
Robust Geographic Predictions: Accurate and reliable geographic inferences.
Population Genetics: Study of genetic variation within populations.
Molecular Ecology: Study of genetic variation in ecological contexts.
Sampling Imbalances: Unequal distribution of samples.
Outliers: Unusual or extreme data points.
Weighted Loss Functions: Loss functions that focus on specific data points.

References

[ASY+20]

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. doi:10.1145/3292500.3330701.

[BRK20]

C.J. Battey, P.L. Ralph, and A.D. Kern. Predicting geographic location from genetic variation with deep neural networks. eLife, 9:e54507, 2020. doi:10.7554/eLife.54507.

[CS23]

C. Chang and K. Schmid. Ggoutlier: an r package to identify and visualize unusual geo-genetic patterns. Journal of Open Source Software, 8(91):5687, 2023. doi:10.21105/joss.05687.

[CBHK02]

N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002. doi:10.5555/1622407.1622416.