geogenie.utils package

Submodules

geogenie.utils.argument_parser module

class geogenie.utils.argument_parser.EvaluateAction(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]

Bases: Action

Custom action for evaluating complex arguments as Python literal structures.

geogenie.utils.argument_parser.load_config(config_path)[source]

Load the YAML configuration file.

Parameters:

config_path (str) – Path to configuration file.

Returns:

Configuration arguments.

Return type:

dict

geogenie.utils.argument_parser.setup_parser(test_mode=False)[source]

Parse command-line arguments.

Returns:

Parsed command line arguments.

Return type:

argparse.Namespace

geogenie.utils.argument_parser.validate_colorscale(parser, args)[source]
geogenie.utils.argument_parser.validate_dtype(parser, args)[source]
geogenie.utils.argument_parser.validate_embeddings(parser, args)[source]
geogenie.utils.argument_parser.validate_gb_params(parser, args)[source]
geogenie.utils.argument_parser.validate_gpu_number(value)[source]

Validate the provided GPU number.

Parameters:

value (str) – The GPU number provided as a command-line argument.

Returns:

The validated GPU number.

Return type:

int

Raises:

argparse.ArgumentTypeError – If the GPU number is invalid.

geogenie.utils.argument_parser.validate_inputs(parser, args, test_mode=False)[source]
geogenie.utils.argument_parser.validate_lower_str(value)[source]
geogenie.utils.argument_parser.validate_max_neighbors(parser, args)[source]
geogenie.utils.argument_parser.validate_n_jobs(value)[source]

Validate the provided n_jobs parameter.

Parameters:

value (int) – the number of jobs to use.

Returns:

The validated n_jobs parameter.

Return type:

int

geogenie.utils.argument_parser.validate_positive_float(value)[source]

Validate that the provided value is a positive float.

geogenie.utils.argument_parser.validate_positive_int(value)[source]

Validate that the provided value is a positive integer.

geogenie.utils.argument_parser.validate_seed(value)[source]
geogenie.utils.argument_parser.validate_significance_levels(parser, args)[source]
geogenie.utils.argument_parser.validate_smote(parser, args)[source]
geogenie.utils.argument_parser.validate_split(value)[source]
geogenie.utils.argument_parser.validate_str2list(arg)[source]
geogenie.utils.argument_parser.validate_verbosity(value)[source]
geogenie.utils.argument_parser.validate_weighted_opts(parser, args)[source]

geogenie.utils.callbacks module

class geogenie.utils.callbacks.EarlyStopping(output_dir, prefix, patience=100, verbose=0, delta=0, trial=None, boot=None)[source]

Bases: object

Early stopping PyTorch callback.

This class defines an early stopping callback for PyTorch models.

patience

Number of epochs to wait for improvement before stopping.

Type:

int

verbose

Verbosity mode.

Type:

int

delta

Minimum change to qualify as an improvement.

Type:

float

output_dir

Directory to save the outputs.

Type:

str

prefix

Prefix for the saved files.

Type:

str

counter

Counter for the number of epochs since last improvement.

Type:

int

early_stop

Flag to indicate if early stopping is triggered.

Type:

bool

best_score

Best score for the monitored quantity.

Type:

float

val_loss_min

Minimum validation loss.

Type:

float

boot

Boot object or identifier.

Type:

int

trial

Trial number for hyperparameter optimization.

Type:

int

logger

Logger object for the class.

Type:

logging.Logger

load_best_model(model)[source]

Load the best model from the checkpoint file.

Parameters:

model (torch.nn.Module) – The model to load the checkpoint into.

Returns:

The model with weights loaded from the best checkpoint.

Return type:

torch.nn.Module

Raises:

FileNotFoundError – If the checkpoint file is not found.

save_checkpoint(val_loss, model)[source]

Save the model when validation loss decreases.

Parameters:
  • val_loss (float) – Current validation loss.

  • model (torch.nn.Module) – The model being trained.

geogenie.utils.callbacks.callback_init(optimizer, args, trial=None, boot=None)[source]

Initialize early stopping and learning rate scheduler callbacks.

EarlyStopping Arguments: output_dir (str): Directory to save the outputs. prefix (str): Prefix for the saved files. patience (int): Number of epochs to wait for improvement before stopping. verbose (bool): If True, prints messages about early stopping. delta (float): Minimum change to qualify as an improvement. trial (optuna.trial.Trial): Optuna trial object. boot (any): Boot object or identifier.

ReduceLROnPlateau Arguments:

optimizer (torch.optim.Optimizer): Wrapped optimizer. mode (str): One of ‘min’, ‘max’. In ‘min’ mode, lr will be reduced when the quantity monitored has stopped decreasing; in ‘max’ mode it will be reduced when the quantity monitored has stopped increasing. factor (float): Factor by which the learning rate will be reduced. new_lr = lr * factor. patience (int): Number of epochs with no improvement after which learning rate will be reduced. verbose (bool): If True, prints a message to stdout for each update.

Parameters:
  • optimizer (torch.optim.Optimizer) – The optimizer for which the learning rate scheduler will be applied.

  • args (argparse.Namespace) – Argument namespace containing the necessary hyperparameters and settings.

  • trial (optuna.trial.Trial, optional) – Optuna trial object for hyperparameter optimization. Defaults to None.

  • boot (any, optional) – Boot object or identifier, used for early stopping callback. Defaults to None.

Returns:

A tuple containing the initialized EarlyStopping and ReduceLROnPlateau scheduler.

Return type:

tuple

geogenie.utils.data module

class geogenie.utils.data.CustomDataset(features, labels=None, sample_weights=None, sample_ids=None, dtype=torch.float32)[source]

Bases: Dataset

Class to create a custom PyTorch Dataset with sample weighting and sample IDs.

This class defines a custom PyTorch Dataset that incorporates sample weighting and sample IDs.

tensors

Tuple consisting of (features, labels, sample_weights, sample_ids).

Type:

tuple

property features

Get the features tensor.

property labels

Get the labels tensor.

property n_features

Return the number of columns in the features dataset.

property n_labels

Return the number of columns in the labels dataset.

property sample_ids

Get the sample IDs.

property sample_weights

Get the sample weights.

geogenie.utils.data_structure module

class geogenie.utils.data_structure.DataStructure(vcf_file, verbose=False, dtype=torch.float32, debug=False)[source]

Bases: object

Class to hold data structure from input VCF file.

High level class overview. Key class functionalities include data loading, preprocessing, transformation, and machine learning model preparation.

Initialization and Data Parsing - It loads VCF (Variant Call Format) files using pysam, processes genotypes, and handles missing data.

Data Transformation and Imputation - Implements methods for allele counting, imputing missing genotypes, normalizing data, and transforming genotypes to various encodings.

Data Splitting - Facilitates splitting data into training, validation, and test sets.

Outlier Detection - Includes methods for detecting and handling outliers in the data.

Data Preprocessing and Embedding: The class contains methods for preprocessing data, including scaling, dimensionality reduction, and embedding using various techniques like PCA, t-SNE, MCA, etc.

Data Analysis and Visualization - The script integrates with the geogenie library for tasks like outlier detection and plotting.

Machine Learning and Data Loading - It includes functionalities for creating data loaders (using torch) and preparing datasets for machine learning tasks, with support for different data sampling strategies and weightings.

Utility Methods - The script provides additional utility methods for tasks like reading GTseq data, setting parameters, and selecting optimal components for different data transformations.

In summary, the script is designed for comprehensive genomic data analysis, offering capabilities for data loading, preprocessing, transformation, machine learning model preparation, and visualization. It is structured to handle data from VCF files, process it through various analytical and transformational steps, and prepare it for further analysis or machine learning tasks.

call_create_dataloaders(X, y, args, is_val, sample_ids, dataset=None)[source]

Helper method to create DataLoader objects for different datasets.

Parameters:
  • X (numpy.ndarray | list) – Feature data.

  • y (numpy.ndarray | None) – Target data.

  • args (argparse.Namespace) – User-supplied arguments.

  • is_val (bool) – Whether the dataset is validation and test data. Default is False.

  • sample_ids (List[str]) – List of sample IDs. Default is None.

  • dataset (str) – Dataset type. Required keyword argument. Default is None.

Returns:

DataLoader object.

Return type:

torch.utils.data.DataLoader

Raises:

ValueError – If sampleIDs are not provided.

count_alleles()[source]

Count alleles for each SNP across all samples.

Returns:

2D array of allele counts with shape (n_loci, 2).

Return type:

numpy.ndarray

create_dataloaders(X, y, batch_size, args, is_val=False, sample_ids=None, dataset=None)[source]

Create dataloaders for training, testing, and validation datasets.

Parameters:
  • X (numpy.ndarray | list) – X dataset. Train, test, or validation.

  • y (numpy.ndarray | None) – Target data (train, test, or validation). None for GNN.

  • batch_size (int) – Batch size to use with model.

  • args (argparse.Namespace) – User-supplied arguments.

  • is_val (bool) – Whether using validation/ test dataset. Otherwise should be training dataset. Default is False.

  • sample_ids (List[str]) – List of sample IDs. Default is None.

  • dataset (str) – Dataset type. Required keyword argument. Default is None.

Returns:

DataLoader object suitable for the specified model type.

Return type:

torch.utils.data.DataLoader

define_params(args)[source]

Defines and updates class attrubutes from an argparse.Namespace object.

This method sets the class attributes based on the parameters provided in the argparse.Namespace object. It is used to update the class parameters with the values provided in the command-line arguments.

Parameters:

args (argparse.Namespace) – Argument namespace containing the parameters.

embed(args, X=None, alg='pca', full_dataset_only=False, transform_only=False)[source]

Embed SNP data using one of several dimensionality reduction techniques.

Parameters:
  • args (argparse.Namespace) – User-supplied arguments.

  • X (numpy.ndarray) – Data to embed. If None, uses self.genotypes_enc. Default is None.

  • alg (str) – Algorithm to use. Default is ‘pca’.

  • full_dataset_only (bool) – If True, only embed and return full dataset. Default is False.

  • transform_only (bool) – If True, only transform without fitting. Default is False.

Returns:

Embedded data.

Return type:

numpy.ndarray

Raises:

EmbeddingError – If the optimal number of components cannot be estimated.

extract_datasets(outliers, args)[source]

Extracts and separates datasets into known and predicted sets based on the presence of missing data.

Parameters:
  • outliers (numpy.ndarray) – Array of outlier indices.

  • args (argparse.Namespace) – User-supplied arguments.

Returns:

Extracted datasets and sample indices.

Return type:

tuple

filter_gt(gt, min_mac, max_snps, allele_counts)[source]

Filter genotypes based on minor allele count and random subsets (max_snps).

Parameters:
  • gt (numpy.ndarray) – Genotypes to filter.

  • min_mac (int) – Minimum minor allele count.

  • max_snps (int, optional) – Maximum number of SNPs to retain.

  • allele_counts (numpy.ndarray) – Allele counts.

Returns:

Filtered genotypes.

Return type:

numpy.ndarray

find_optimal_nmf_components(data, min_components, max_components)[source]

Find the optimal number of components for NMF based on reconstruction error.

Parameters:
  • data (np.array) – The data to fit the NMF model.

  • min_components (int) – The minimum number of components to try.

  • max_components (int) – The maximum number of components to try.

Returns:

The optimal number of components and the reconstruction errors.

Return type:

tuple

generate_unknowns(p=0.1, seed=None, verbose=False)[source]

Randomly choose unknown samples for prediction.

Only gets used if user does not supply and unkowns.

Parameters:
  • p (float) – Proportion of samples to randomly select for the unknown prediction dataset. Defaults to 0.1.

  • seed (int or None) – Random seed to use for the random choice generator. Defaults to None (no random seed supplied).

  • verbose (bool) – Whether in verbose mode. Defaults to False.

get_num_pca_comp(x)[source]

Get optimal number of PCA components.

Parameters:

x (numpy.ndarray) – Dataset to fit PCA to.

Returns:

Optimal number of principal components to use.

Return type:

int

get_sample_weights(y, args)[source]

Gets inverse sample_weights based on sampling density.

Uses scikit-learn KernelDensity with a grid search to estimate the optimal bandwidth for GeographicDensitySampler.

Parameters:
  • y (numpy.ndarray) – Target values.

  • args (argparse.Namespace) – User-supplied arguments.

Returns:

The weighted sampler with estimated sample weights.

Return type:

GeographicDensitySampler

impute_missing(X, transform_only=False)[source]

Impute missing genotypes based on allele frequency threshold.

Parameters:
  • X (numpy.ndarray) – Data to impute.

  • transform_only (bool) – Whether to transform, but not fit. Default is False.

Returns:

Imputed data.

Return type:

numpy.ndarray

is_biallelic(record)[source]

Check if number of alleles is biallelic.

Parameters:

record (pysam.VariantRecord) – A VCF record.

Returns:

True if the record is biallelic, False otherwise.

Return type:

bool

load_and_preprocess_data(args)[source]

Wrapper method to load and preprocess data.

Code execution order is listed below.

Sample Data Loading and Sorting - Calls self.sort_samples with args.sample_data to load and sort sample data. This step involves reading sample data, presumably including their geographical locations, and aligning them with the genomic data.

SNP Encoding Transformation - Transforms Single Nucleotide Polymorphisms (SNPs) into a 0,1,2 encoding format using self.snps_to_012, considering parameters like min_mac (minimum minor allele count) and max_snps (maximum number of SNPs).

Validation of Feature and Target Lengths - Ensures that the feature data (X) and target data (y) have the same number of rows using self.validate_feature_target_len.

Missing Data Imputation on full dataset - Imputes missing data in self.genotypes_enc using self.impute_missing.

Data Embedding and Transformation - Performs an embedding transformation (like PCA) on the imputed data (X) using self.embed, with full_dataset_only=True and transform_only=False.

Index Mask Setup - Defines masks for prediction (self.pred_mask) to identify samples with known and unknown locations. Sets up index masks for data using self.setup_index_masks.

Outlier Detection (Conditional) - If args.detect_outliers is True, performs outlier detection using self.run_outlier_detection.

Extract datasets - self.X, self.y, self.X_pred, self.true_idx, self.all_samples, self.samples, self.pred_samples, self.outlier_samples, self.non_outlier_samples = self.extract_datasets(all_outliers, args)

Plotting Outliers (Conditional) - If outliers are detected, plots the outliers using self.plotting.plot_outliers.

Data Normalization (Placeholder) - Normalizes the target data (y) using self.normalize_target with placeholder=True. This does nothing, unless placeholder=False.

Splitting into Train, Test, and Validation Sets - Splits the dataset into training, validation, and testing sets using self.split_train_test.

Feature Embedding for Train, Validation, and Test Sets (Conditional) - args.embedding_type is not none, embeds the features of training, validation, and test sets using self.embed.

Logging Data Split and DataLoader Creation - Logs the completion of data splitting and the start of DataLoader creation, if verbosity is enabled.

DataLoader Creation - Creates DataLoaders for training, validation, and test datasets using self.call_create_dataloaders. Additional DataLoader is created for gradient boosting if args.use_gradient_boosting is True. Logging Completion of Preprocessing Logs the successful completion of data loading and preprocessing, if verbosity is enabled.

map_alleles_to_iupac(alleles, iupac_dict)[source]

Maps a list of allele tuples to their corresponding IUPAC nucleotide codes.

Parameters:
  • alleles (List[tuple]) – List of tuples representing alleles.

  • iupac_dict (dict) – Dictionary mapping allele tuples to IUPAC codes.

Returns:

List of IUPAC nucleotide codes corresponding to the alleles.

Return type:

List[str]

map_outliers_through_filters(original_indices, filter_stages, outliers)[source]

Maps outlier indices through multiple filtering stages back to the original dataset.

Parameters:
  • original_indices (np.ndarray) – Array of original indices before any filtering.

  • filter_stages (List[np.ndarray]) – List of arrays of indices after each filtering stage.

  • outliers (np.ndarray) – Outlier indices in the most filtered dataset.

Returns:

Mapped outlier indices in the original dataset.

Return type:

np.ndarray

normalize_target(y, transform_only=False)[source]

Normalize locations, ignoring NaN.

Parameters:
  • y (numpy.ndarray) – Array of locations to normalize.

  • transform_only (bool) – Whether to transform without fitting. Default is False.

Returns:

Normalized locations.

Return type:

numpy.ndarray

property params

Getter for the params dictionary.

Returns:

Parameters dictionary.

Return type:

dict

perform_mca_and_select_components(data, n_components_range, S)[source]

Perform MCA on the provided data and select the optimal number of components.

Parameters:
  • data (pd.DataFrame) – The categorical data.

  • n_components_range (list) – The range of components to explore.

  • S (float) – Sensitivity setting for selecting optimal number of components.

Returns:

The optimal number of components.

Return type:

int

run_outlier_detection(args, X, indices, y, index, filter_stage_indices)[source]

Performs outlier detection using geographic and genetic criteria.

Parameters:
  • args (argparse.Namespace) – User-supplied arguments.

  • X (numpy.ndarray) – Feature matrix.

  • indices (numpy.ndarray) – Indices of samples.

  • y (numpy.ndarray) – Target variable (coordinates).

  • index (numpy.ndarray) – Index array for samples.

  • filter_stage_indices (List[np.ndarray]) – List of filter stage indices.

Returns:

Array of outlier indices.

Return type:

numpy.ndarray

Raises:

OutlierDetectionError – If an error occurs during outlier detection.

select_optimal_components(cumulative_inertia, S)[source]

Select the optimal number of components based on explained inertia.

Parameters:
  • cumulative_inertia (list) – The cumulative inertia for each component.

  • S (float) – Sensitivity setting for selecting optimal number of components.

Returns:

The optimal number of components.

Return type:

int

setup_index_masks(X)[source]

Sets up index masks for filtering the data.

Parameters:

X (numpy.ndarray) – Feature data.

Returns:

Filtered feature data, indices, target data, and sample indices.

Return type:

tuple

snps_to_012(min_mac=2, max_snps=None, return_values=True)[source]

Convert IUPAC SNPs to 012 encodings.

Parameters:
  • min_mac (int) – Minimum minor allele count. Default is 2.

  • max_snps (int, optional) – Maximum number of SNPs to retain.

  • return_values (bool) – Whether to return encoded values. Default is True.

Returns:

Encoded genotypes if return_values is True, otherwise updates internal state.

Return type:

numpy.ndarray

sort_samples(sample_data_filename)[source]

Load sample_data and popmap and sort to match VCF file.

Parameters:

sample_data_filename (str) – Filename of the sample data file.

Raises:

InvalidSampleDataError – If the sample data file format is incorrect.

split_train_test(train_split, val_split, seed, args)[source]

Splits the data into training, validation, and test datasets.

Parameters:
  • train_split (float) – Proportion of the data to use for training.

  • val_split (float) – Proportion of the data to use for validation.

  • seed (int) – Random seed for reproducibility.

  • args (argparse.Namespace) – Argument namespace containing additional parameters.

validate_feature_target_len()[source]

Validate that the feature and target datasets have the same length.

Raises:

InvalidInputShapeError – If the shapes of the feature and target datasets do not match.

geogenie.utils.exceptions module

exception geogenie.utils.exceptions.DataStructureError[source]

Bases: Exception

Base class for exceptions in DataStructure.

exception geogenie.utils.exceptions.EmbeddingError(message='n_components could not be estimated for embedding.')[source]

Bases: DataStructureError

Exception raised for errors during embedding.

exception geogenie.utils.exceptions.GPUUnavailableError(message='Specified GPU is not available.')[source]

Bases: Exception

Exception raised when a GPU is specified but not available.

exception geogenie.utils.exceptions.InvalidInputShapeError(shape1, shape2)[source]

Bases: DataStructureError

Exception raised for invalid input shapes.

exception geogenie.utils.exceptions.InvalidSampleDataError(message='Invalid sample data format. Expected a tab-delimited file with three columns: sampleID, x, and y.')[source]

Bases: DataStructureError

Exception raised for errors in the sample data file.

exception geogenie.utils.exceptions.OutlierDetectionError(message='Error occurred during outlier detection.')[source]

Bases: DataStructureError

Exception raised for errors during outlier detection.

exception geogenie.utils.exceptions.ResourceAllocationError(message='Specified resource is not available.')[source]

Bases: Exception

Exception raised when a specified resource is invalid.

exception geogenie.utils.exceptions.SampleOrderingError(message='Invalid sample ordering after filtering and sorting.')[source]

Bases: DataStructureError

Exception raised for errors in the sample ordering.

exception geogenie.utils.exceptions.TimeoutException[source]

Bases: Exception

geogenie.utils.logger module

geogenie.utils.logger.setup_logger(log_file, log_level=20)[source]

Function to set up a logger for logging info, warnings, errors, and debug prints.

Parameters:
  • log_file (str) – Filename for log file.

  • log_level (int) – Log level to use. Should be either logging.INFO, logging.WARNING, logging.ERROR, or logging.DEBUG. These levels are converted to integers when called. Defaults to logging.INFO.

geogenie.utils.loss module

class geogenie.utils.loss.WeightedDRMSLoss(radius=6371)[source]

Bases: Module

Custom loss class to compute the Distance Root Mean Square (DRMS) for longitude and latitude coordinates.

radius

Radius of the Earth in kilometers. Default is 6371 km.

Type:

float

forward(preds, targets, sample_weight=None)[source]

Forward pass to compute the Distance Root Mean Square (DRMS) loss.

Parameters:
  • preds (torch.Tensor) – Predicted longitude and latitude coordinates.

  • targets (torch.Tensor) – Actual longitude and latitude coordinates.

  • sample_weight (torch.Tensor) – Sample weights to make some samples more or less important than others. Defaults to None.

Returns:

DRMS loss.

Return type:

torch.Tensor

class geogenie.utils.loss.WeightedHuberLoss(delta=1.0, smoothing_factor=0.1)[source]

Bases: Module

Custom loss class to compute the Weighted Huber Loss.

delta

The threshold for the Huber loss. Default is 1.0.

Type:

float

smoothing_factor

The smoothing factor for the target. Default is 0.1.

Type:

float

forward(input, target, sample_weight=None)[source]

Forward pass to compute the Weighted Huber Loss.

geogenie.utils.loss.weighted_rmse_loss(y_true, y_pred, sample_weight=None)[source]

Custom PyTorch weighted RMSE loss function.

This method computes the weighted RMSE loss between the ground truth and predictions.

Parameters:
  • y_true (torch.Tensor) – Ground truth values.

  • y_pred (torch.Tensor) – Predictions.

  • sample_weight (torch.Tensor) – Sample weights (1-dimensional).

Returns:

Weighted RMSE loss.

Return type:

float

geogenie.utils.scorers module

class geogenie.utils.scorers.LocallyLinearEmbeddingWrapper(*, n_neighbors=5, n_components=2, reg=0.001, eigen_solver='auto', tol=1e-06, max_iter=100, method='standard', hessian_tol=0.0001, modified_tol=1e-12, neighbors_algorithm='auto', random_state=None, n_jobs=None)[source]

Bases: LocallyLinearEmbedding

Wrapper class for LocallyLinearEmbedding to allow for prediction.

_sklearn_auto_wrap_output_keys = {'transform'}
static lle_reconstruction_scorer(estimator, X, y=None)[source]

Compute the negative reconstruction error for an LLE model to use as a scorer. GridSearchCV assumes that higher score values are better, so the reconstruction error is negated.

Parameters:
  • estimator (LocallyLinearEmbedding) – Fitted LLE model.

  • X (numpy.ndarray) – Original high-dimensional data.

Returns:

Negative reconstruction error.

Return type:

float

predict(X)[source]
geogenie.utils.scorers.calculate_r2_knn(predicted_data, actual_data)[source]

Calculate the coefficient of determination (R^2) for predictions.

Parameters:
  • predicted_data (np.array) – Predicted data from KNN.

  • actual_data (np.array) – Actual data.

Returns:

R^2 value.

Return type:

float

geogenie.utils.scorers.calculate_rmse(preds, targets)[source]
geogenie.utils.scorers.haversine_distance(coord1, coord2)[source]

Calculate the Haversine distance between two geographic coordinate points.

Parameters:
  • coord1 (tuple) – Latitude and longitude for each point.

  • coord2 (tuple) – Latitude and longitude for each point.

Returns:

Distance in kilometers.

Return type:

float

geogenie.utils.scorers.kstest(y_true, y_pred, sample_weight=None)[source]

Perform the Kolmogorov-Smirnov test on the Haversine errors.

geogenie.utils.spatial_data_processors module

class geogenie.utils.spatial_data_processors.SpatialDataProcessor(output_dir=None, basemap_fips=None, crs='EPSG:4326', logger=None)[source]

Bases: object

Spatial data processing class.

This class provides a set of tools for processing spatial data, including calculating statistics, distances, and clustering.

tmpdir

Temporary directory for shapefiles.

Type:

Path

output_dir

Output directory for shapefiles.

Type:

Path

basemap_fips

FIPS code for the base map.

Type:

str

crs

Coordinate reference system.

Type:

str

logger

Logger object.

Type:

logging.Logger

_ensure_is_gdf(gdf)[source]

Ensure the data is a GeoPandas GeoDataFrame.

Parameters:

gdf (geopandas.GeoDataFrame) – GeoDataFrame to validate.

Returns:

Validated GeoDataFrame.

Return type:

geopandas.GeoDataFrame

Raises:

TypeError – If the data is not a valid format.

_ensure_is_numpy(X)[source]

Ensure the data is a NumPy array.

Parameters:

X (np.ndarray) – Data to validate.

Returns:

Validated NumPy array.

Return type:

np.ndarray

Raises:

TypeError – If the data is not a valid format.

_ensure_is_pandas(df)[source]

Ensure the data is a Pandas DataFrame.

Parameters:

df (pandas.DataFrame) – Data to validate.

Returns:

Validated DataFrame.

Return type:

pandas.DataFrame

Raises:

TypeError – If the data is not a valid format.

_validate_dists(err)[source]
calculate_bounding_box(gdf)[source]

Calculate the bounding box for the dataset.

calculate_convex_hull(gdf)[source]

Calculate the convex hull of all points.

calculate_statistics(gdf, max_boots=None, seed=None, known_coords=None)[source]

Calculate statistics on properly projected data.

detect_clusters(gdf, eps=0.5, min_samples=5)[source]

Detect clusters using DBSCAN.

detect_outliers(gdf, threshold=2)[source]

Identify outliers based on distance from the mean.

extract_basemap_path_url(url)[source]

Extract base map from provided URL or file path.

Parameters:

url (str) – URL or file path to extract base map from.

Returns:

Extracted base map path or URL.

Return type:

str

Raises:

ValueError – If the base map FIPS code is not provided.

geodesic_distance(coords1, coords2)[source]

Calculate geodesic distance between two sets of coordinates.

haversine_distance(coords1, coords2)[source]

Calculate haversine distance between two sets of points.

Parameters:
  • coords1 (np.ndarray) – First set of coordinates.

  • coords2 (np.ndarray) – Second set of coordinates.

Returns:

Haversine distance between coords1 and coords2.

Return type:

np.ndarray

haversine_error(point1, point2)[source]

Calculate the haversine error between two geographic points.

Parameters:
  • point1 (float) – First point to calculate error from.

  • point2 (float) – Second point to calculate error from.

Returns:

Haversine error between points 1 and 2.

Return type:

np.ndarray

nearest_neighbor(gdf)[source]

Calculate the nearest neighbor for each point.

spherical_mean(gdf)[source]

Calculate spherical mean of geographic points.

to_geopandas(df)[source]

Convert DataFrame to GeoPandas GeoDataFrame with proper CRS.

Parameters:

df (pandas.DataFrame) – DataFrame to convert to geopandas.GeoDataFrame.

Returns:

Converted GeoDataFrame object.

Return type:

geopandas.GeoDataFrame

to_numpy(gdf)[source]

Convert GeoPandas GeoDataFrame to NumPy array.

Parameters:

gdf (geopandas.GeoDataFrame) – GeoDataFrame to convert to numpy.

Returns:

Converted numpy array.

Return type:

numpy.ndarray

to_pandas(gdf)[source]

Convert GeoPandas GeoDataFrame to Pandas DataFrame.

Parameters:

gdf (geopandas.GeoDataFrame) – GeoDataFrame to convert to pandas.DataFrame.

Returns:

pandas DataFrame object.

Return type:

pandas.DataFrame

geogenie.utils.transformers module

class geogenie.utils.transformers.MCA(n_components=2, n_iter=10, check_input=True, random_state=None, one_hot=True, categories=[0, 1, 2], epsilon=1e-05)[source]

Bases: BaseEstimator, TransformerMixin

Class to perform Multiple Correspondence Analayis (MCA).

This class performs Multiple Correspondence Analysis (MCA) on the input data.

n_components

Number of MCA components to output.

Type:

int

n_iter

Number of randomized SVD iterations to perform.

Type:

int

check_input

Whether to check input data for conformity.

Type:

bool

random_state

Random state for reproducibility.

Type:

int or None

one_hot

Flag for one-hot encoding the input data.

Type:

bool

categories

Possible categories in input features.

Type:

list

epsilon

Small value to prevent division by 0.

Type:

float

logger

Logger object for the class

Type:

logging.Logger

_compute_S_matrix(X)[source]

Compute the S matrix.

Parameters:

X (np.ndarray) – The input data.

Returns:

The S matrix

Return type:

sp.spmatrix

_normalize_data(X)[source]

Normalize the input data.

Parameters:

X (np.ndarray) – Array to normalize.

Returns:

Normalized array.

Return type:

np.ndarray

_sklearn_auto_wrap_output_keys = {'transform'}
_store_results()[source]

Store the results of the MCA.

fit(X, y=None)[source]

Fit the input data.

Parameters:
  • X (np.ndarray) – Array to fit.

  • y (None, optional) – Ignored. This parameter exists only for compatibility with the sklearn API.

transform(X)[source]

Transform input data X using MCA.

Parameters:

X (np.ndarray) – Array to transform.

class geogenie.utils.transformers.MinMaxScalerGeo(lat_range=(-90, 90), lon_range=(-180, 180), scale_min=0, scale_max=1)[source]

Bases: BaseEstimator, TransformerMixin

Class to scale geographic coordinates to a specified range.

lat_range

Minimum and maximum values for latitude.

Type:

tuple

lon_range

Minimum and maximum values for longitude.

Type:

tuple

scale_min

Minimum value of the scaled range.

Type:

float

scale_max

Maximum value of the scaled range.

Type:

float

logger

Logger object for the class.

Type:

logging.Logger

_sklearn_auto_wrap_output_keys = {'transform'}
fit(X, y=None)[source]

Fit does nothing as parameters are not data-dependent.

Parameters:
  • X (array-like) – The data to fit. Ignored. This parameter exists only for compatibility with the sklearn API.

  • y (None, optional) – Ignored. This parameter exists only for compatibility with the sklearn API.

Returns:

Returns the instance itself.

Return type:

self

inverse_transform(X_scaled)[source]

Scale back the coordinates to their original range.

Parameters:

X_scaled (array-like) – The scaled coordinates to revert.

Returns:

Original geographic coordinates.

Return type:

np.array

set_inverse_transform_request(*, X_scaled: bool | None | str = '$UNCHANGED$') MinMaxScalerGeo

Request metadata passed to the inverse_transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

X_scaled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_scaled parameter in inverse_transform.

Returns:

self – The updated object.

Return type:

object

transform(X)[source]

Scale the geographic coordinates based on the provided ranges.

Parameters:

X (array-like) – The input coordinates to transform. Expected shape (n_samples, 2) where X[:, 0] should be longitude and X[:, 1] should be latitude.

Returns:

Transformed coordinates, where each feature is scaled to [scale_min, scale_max].

Return type:

np.array

geogenie.utils.utils module

geogenie.utils.utils.assign_to_bins(df, long_col, lat_col, n_bins, args, method='optics', min_samples=None, random_state=None)[source]

Assign longitude and latitude coordinates to bins based on K-Means or DBSCAN clustering.

Parameters:
  • df (pd.DataFrame) – DataFrame containing longitude and latitude columns.

  • long_col (str) – Name of the column containing longitude values.

  • lat_col (str) – Name of the column containing latitude values.

  • n_bins (int) – Number of bins (clusters) for KMeans or minimum samples for OPTICS.

  • method (str) – Clustering method (‘kmeans’ or ‘optics’). Defaults to ‘optics’.

  • min_samples (int) – Minimum number of samples for OPTICS. Defaults to None (4).

  • random_state (int or RandomState) – Random seed or state for reproducibility. Defaults to None (new seed each time).

Returns:

Numpy array with bin indices indicating the bin assignment.

Return type:

np.ndarray

geogenie.utils.utils.check_column_dtype(df, column_name)[source]

Check if a DataFrame column is string dtype or numeric dtype.

Parameters:
  • df (pd.DataFrame) – The DataFrame to check.

  • column_name (str) – The name of the column to check.

Returns:

‘string’ if the column is of string dtype, ‘numeric’ if the column is of numeric dtype, otherwise ‘other’.

Return type:

str

geogenie.utils.utils.detect_separator(file_path)[source]

Detects the separator used in a CSV file by reading the first line.

Parameters:

file_path (str) – Path to the CSV file.

Returns:

The detected separator (one of ‘,’, ‘ ‘, or ‘ ‘).

Return type:

str

geogenie.utils.utils.geo_coords_is_valid(coordinates)[source]

Validates that a given NumPy array contains valid geographic coordinates.

Parameters:

coordinates (np.ndarray) – A NumPy array of shape (n_samples, 2) where the first column is longitude and the second is latitude.

Raises:

ValueError – If the array shape is not (n_samples, 2), or if the longitude and latitude values are not in their respective valid ranges.

Returns:

True if the validation passes, indicating the coordinates are valid.

Return type:

bool

geogenie.utils.utils.get_iupac_dict()[source]
geogenie.utils.utils.read_csv_with_dynamic_sep(file_path)[source]

Reads a CSV file with a dynamically detected separator, replacing spaces with tabs in-memory.

Parameters:

file_path (str) – Path to the CSV file.

Returns:

The DataFrame containing the CSV data.

Return type:

pd.DataFrame

geogenie.utils.utils.time_limit(seconds)[source]

Context manager to terminate execution of anything within the context. If seconds are exceeded in terms of execution time, then the code within the context gets skipped.

geogenie.utils.utils.validate_is_numpy(features, labels, sample_weights)[source]

Ensure that features, labels, and sample_weights are numpy arrays and not PyTorch Tensors.

Module contents