geogenie.utils package
Submodules
geogenie.utils.argument_parser module
- class geogenie.utils.argument_parser.EvaluateAction(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]
Bases:
ActionCustom action for evaluating complex arguments as Python literal structures.
- geogenie.utils.argument_parser.load_config(config_path)[source]
Load the YAML configuration file.
- Parameters:
config_path (str) – Path to configuration file.
- Returns:
Configuration arguments.
- Return type:
dict
- geogenie.utils.argument_parser.setup_parser(test_mode=False)[source]
Parse command-line arguments.
- Returns:
Parsed command line arguments.
- Return type:
argparse.Namespace
- geogenie.utils.argument_parser.validate_gpu_number(value)[source]
Validate the provided GPU number.
- Parameters:
value (str) – The GPU number provided as a command-line argument.
- Returns:
The validated GPU number.
- Return type:
int
- Raises:
argparse.ArgumentTypeError – If the GPU number is invalid.
- geogenie.utils.argument_parser.validate_n_jobs(value)[source]
Validate the provided n_jobs parameter.
- Parameters:
value (int) – the number of jobs to use.
- Returns:
The validated n_jobs parameter.
- Return type:
int
- geogenie.utils.argument_parser.validate_positive_float(value)[source]
Validate that the provided value is a positive float.
geogenie.utils.callbacks module
- class geogenie.utils.callbacks.EarlyStopping(output_dir, prefix, patience=100, verbose=0, delta=0, trial=None, boot=None)[source]
Bases:
objectEarly stopping PyTorch callback.
This class defines an early stopping callback for PyTorch models.
- patience
Number of epochs to wait for improvement before stopping.
- Type:
int
- verbose
Verbosity mode.
- Type:
int
- delta
Minimum change to qualify as an improvement.
- Type:
float
- output_dir
Directory to save the outputs.
- Type:
str
- prefix
Prefix for the saved files.
- Type:
str
- counter
Counter for the number of epochs since last improvement.
- Type:
int
- early_stop
Flag to indicate if early stopping is triggered.
- Type:
bool
- best_score
Best score for the monitored quantity.
- Type:
float
- val_loss_min
Minimum validation loss.
- Type:
float
- boot
Boot object or identifier.
- Type:
int
- trial
Trial number for hyperparameter optimization.
- Type:
int
- logger
Logger object for the class.
- Type:
logging.Logger
- load_best_model(model)[source]
Load the best model from the checkpoint file.
- Parameters:
model (torch.nn.Module) – The model to load the checkpoint into.
- Returns:
The model with weights loaded from the best checkpoint.
- Return type:
torch.nn.Module
- Raises:
FileNotFoundError – If the checkpoint file is not found.
- geogenie.utils.callbacks.callback_init(optimizer, args, trial=None, boot=None)[source]
Initialize early stopping and learning rate scheduler callbacks.
EarlyStopping Arguments: output_dir (str): Directory to save the outputs. prefix (str): Prefix for the saved files. patience (int): Number of epochs to wait for improvement before stopping. verbose (bool): If True, prints messages about early stopping. delta (float): Minimum change to qualify as an improvement. trial (optuna.trial.Trial): Optuna trial object. boot (any): Boot object or identifier.
- ReduceLROnPlateau Arguments:
optimizer (torch.optim.Optimizer): Wrapped optimizer. mode (str): One of ‘min’, ‘max’. In ‘min’ mode, lr will be reduced when the quantity monitored has stopped decreasing; in ‘max’ mode it will be reduced when the quantity monitored has stopped increasing. factor (float): Factor by which the learning rate will be reduced. new_lr = lr * factor. patience (int): Number of epochs with no improvement after which learning rate will be reduced. verbose (bool): If True, prints a message to stdout for each update.
- Parameters:
optimizer (torch.optim.Optimizer) – The optimizer for which the learning rate scheduler will be applied.
args (argparse.Namespace) – Argument namespace containing the necessary hyperparameters and settings.
trial (optuna.trial.Trial, optional) – Optuna trial object for hyperparameter optimization. Defaults to None.
boot (any, optional) – Boot object or identifier, used for early stopping callback. Defaults to None.
- Returns:
A tuple containing the initialized EarlyStopping and ReduceLROnPlateau scheduler.
- Return type:
tuple
geogenie.utils.data module
- class geogenie.utils.data.CustomDataset(features, labels=None, sample_weights=None, sample_ids=None, dtype=torch.float32)[source]
Bases:
DatasetClass to create a custom PyTorch Dataset with sample weighting and sample IDs.
This class defines a custom PyTorch Dataset that incorporates sample weighting and sample IDs.
- tensors
Tuple consisting of (features, labels, sample_weights, sample_ids).
- Type:
tuple
- property features
Get the features tensor.
- property labels
Get the labels tensor.
- property n_features
Return the number of columns in the features dataset.
- property n_labels
Return the number of columns in the labels dataset.
- property sample_ids
Get the sample IDs.
- property sample_weights
Get the sample weights.
geogenie.utils.data_structure module
- class geogenie.utils.data_structure.DataStructure(vcf_file, verbose=False, dtype=torch.float32, debug=False)[source]
Bases:
objectClass to hold data structure from input VCF file.
High level class overview. Key class functionalities include data loading, preprocessing, transformation, and machine learning model preparation.
Initialization and Data Parsing - It loads VCF (Variant Call Format) files using pysam, processes genotypes, and handles missing data.
Data Transformation and Imputation - Implements methods for allele counting, imputing missing genotypes, normalizing data, and transforming genotypes to various encodings.
Data Splitting - Facilitates splitting data into training, validation, and test sets.
Outlier Detection - Includes methods for detecting and handling outliers in the data.
Data Preprocessing and Embedding: The class contains methods for preprocessing data, including scaling, dimensionality reduction, and embedding using various techniques like PCA, t-SNE, MCA, etc.
Data Analysis and Visualization - The script integrates with the geogenie library for tasks like outlier detection and plotting.
Machine Learning and Data Loading - It includes functionalities for creating data loaders (using torch) and preparing datasets for machine learning tasks, with support for different data sampling strategies and weightings.
Utility Methods - The script provides additional utility methods for tasks like reading GTseq data, setting parameters, and selecting optimal components for different data transformations.
In summary, the script is designed for comprehensive genomic data analysis, offering capabilities for data loading, preprocessing, transformation, machine learning model preparation, and visualization. It is structured to handle data from VCF files, process it through various analytical and transformational steps, and prepare it for further analysis or machine learning tasks.
- call_create_dataloaders(X, y, args, is_val, sample_ids, dataset=None)[source]
Helper method to create DataLoader objects for different datasets.
- Parameters:
X (numpy.ndarray | list) – Feature data.
y (numpy.ndarray | None) – Target data.
args (argparse.Namespace) – User-supplied arguments.
is_val (bool) – Whether the dataset is validation and test data. Default is False.
sample_ids (List[str]) – List of sample IDs. Default is None.
dataset (str) – Dataset type. Required keyword argument. Default is None.
- Returns:
DataLoader object.
- Return type:
torch.utils.data.DataLoader
- Raises:
ValueError – If sampleIDs are not provided.
- count_alleles()[source]
Count alleles for each SNP across all samples.
- Returns:
2D array of allele counts with shape (n_loci, 2).
- Return type:
numpy.ndarray
- create_dataloaders(X, y, batch_size, args, is_val=False, sample_ids=None, dataset=None)[source]
Create dataloaders for training, testing, and validation datasets.
- Parameters:
X (numpy.ndarray | list) – X dataset. Train, test, or validation.
y (numpy.ndarray | None) – Target data (train, test, or validation). None for GNN.
batch_size (int) – Batch size to use with model.
args (argparse.Namespace) – User-supplied arguments.
is_val (bool) – Whether using validation/ test dataset. Otherwise should be training dataset. Default is False.
sample_ids (List[str]) – List of sample IDs. Default is None.
dataset (str) – Dataset type. Required keyword argument. Default is None.
- Returns:
DataLoader object suitable for the specified model type.
- Return type:
torch.utils.data.DataLoader
- define_params(args)[source]
Defines and updates class attrubutes from an argparse.Namespace object.
This method sets the class attributes based on the parameters provided in the argparse.Namespace object. It is used to update the class parameters with the values provided in the command-line arguments.
- Parameters:
args (argparse.Namespace) – Argument namespace containing the parameters.
- embed(args, X=None, alg='pca', full_dataset_only=False, transform_only=False)[source]
Embed SNP data using one of several dimensionality reduction techniques.
- Parameters:
args (argparse.Namespace) – User-supplied arguments.
X (numpy.ndarray) – Data to embed. If None, uses self.genotypes_enc. Default is None.
alg (str) – Algorithm to use. Default is ‘pca’.
full_dataset_only (bool) – If True, only embed and return full dataset. Default is False.
transform_only (bool) – If True, only transform without fitting. Default is False.
- Returns:
Embedded data.
- Return type:
numpy.ndarray
- Raises:
EmbeddingError – If the optimal number of components cannot be estimated.
- extract_datasets(outliers, args)[source]
Extracts and separates datasets into known and predicted sets based on the presence of missing data.
- Parameters:
outliers (numpy.ndarray) – Array of outlier indices.
args (argparse.Namespace) – User-supplied arguments.
- Returns:
Extracted datasets and sample indices.
- Return type:
tuple
- filter_gt(gt, min_mac, max_snps, allele_counts)[source]
Filter genotypes based on minor allele count and random subsets (max_snps).
- Parameters:
gt (numpy.ndarray) – Genotypes to filter.
min_mac (int) – Minimum minor allele count.
max_snps (int, optional) – Maximum number of SNPs to retain.
allele_counts (numpy.ndarray) – Allele counts.
- Returns:
Filtered genotypes.
- Return type:
numpy.ndarray
- find_optimal_nmf_components(data, min_components, max_components)[source]
Find the optimal number of components for NMF based on reconstruction error.
- Parameters:
data (np.array) – The data to fit the NMF model.
min_components (int) – The minimum number of components to try.
max_components (int) – The maximum number of components to try.
- Returns:
The optimal number of components and the reconstruction errors.
- Return type:
tuple
- generate_unknowns(p=0.1, seed=None, verbose=False)[source]
Randomly choose unknown samples for prediction.
Only gets used if user does not supply and unkowns.
- Parameters:
p (float) – Proportion of samples to randomly select for the unknown prediction dataset. Defaults to 0.1.
seed (int or None) – Random seed to use for the random choice generator. Defaults to None (no random seed supplied).
verbose (bool) – Whether in verbose mode. Defaults to False.
- get_num_pca_comp(x)[source]
Get optimal number of PCA components.
- Parameters:
x (numpy.ndarray) – Dataset to fit PCA to.
- Returns:
Optimal number of principal components to use.
- Return type:
int
- get_sample_weights(y, args)[source]
Gets inverse sample_weights based on sampling density.
Uses scikit-learn KernelDensity with a grid search to estimate the optimal bandwidth for GeographicDensitySampler.
- Parameters:
y (numpy.ndarray) – Target values.
args (argparse.Namespace) – User-supplied arguments.
- Returns:
The weighted sampler with estimated sample weights.
- Return type:
GeographicDensitySampler
- impute_missing(X, transform_only=False)[source]
Impute missing genotypes based on allele frequency threshold.
- Parameters:
X (numpy.ndarray) – Data to impute.
transform_only (bool) – Whether to transform, but not fit. Default is False.
- Returns:
Imputed data.
- Return type:
numpy.ndarray
- is_biallelic(record)[source]
Check if number of alleles is biallelic.
- Parameters:
record (pysam.VariantRecord) – A VCF record.
- Returns:
True if the record is biallelic, False otherwise.
- Return type:
bool
- load_and_preprocess_data(args)[source]
Wrapper method to load and preprocess data.
Code execution order is listed below.
Sample Data Loading and Sorting - Calls self.sort_samples with args.sample_data to load and sort sample data. This step involves reading sample data, presumably including their geographical locations, and aligning them with the genomic data.
SNP Encoding Transformation - Transforms Single Nucleotide Polymorphisms (SNPs) into a 0,1,2 encoding format using self.snps_to_012, considering parameters like min_mac (minimum minor allele count) and max_snps (maximum number of SNPs).
Validation of Feature and Target Lengths - Ensures that the feature data (X) and target data (y) have the same number of rows using self.validate_feature_target_len.
Missing Data Imputation on full dataset - Imputes missing data in self.genotypes_enc using self.impute_missing.
Data Embedding and Transformation - Performs an embedding transformation (like PCA) on the imputed data (X) using self.embed, with full_dataset_only=True and transform_only=False.
Index Mask Setup - Defines masks for prediction (self.pred_mask) to identify samples with known and unknown locations. Sets up index masks for data using self.setup_index_masks.
Outlier Detection (Conditional) - If args.detect_outliers is True, performs outlier detection using self.run_outlier_detection.
Extract datasets - self.X, self.y, self.X_pred, self.true_idx, self.all_samples, self.samples, self.pred_samples, self.outlier_samples, self.non_outlier_samples = self.extract_datasets(all_outliers, args)
Plotting Outliers (Conditional) - If outliers are detected, plots the outliers using self.plotting.plot_outliers.
Data Normalization (Placeholder) - Normalizes the target data (y) using self.normalize_target with placeholder=True. This does nothing, unless placeholder=False.
Splitting into Train, Test, and Validation Sets - Splits the dataset into training, validation, and testing sets using self.split_train_test.
Feature Embedding for Train, Validation, and Test Sets (Conditional) - args.embedding_type is not none, embeds the features of training, validation, and test sets using self.embed.
Logging Data Split and DataLoader Creation - Logs the completion of data splitting and the start of DataLoader creation, if verbosity is enabled.
DataLoader Creation - Creates DataLoaders for training, validation, and test datasets using self.call_create_dataloaders. Additional DataLoader is created for gradient boosting if args.use_gradient_boosting is True. Logging Completion of Preprocessing Logs the successful completion of data loading and preprocessing, if verbosity is enabled.
- map_alleles_to_iupac(alleles, iupac_dict)[source]
Maps a list of allele tuples to their corresponding IUPAC nucleotide codes.
- Parameters:
alleles (List[tuple]) – List of tuples representing alleles.
iupac_dict (dict) – Dictionary mapping allele tuples to IUPAC codes.
- Returns:
List of IUPAC nucleotide codes corresponding to the alleles.
- Return type:
List[str]
- map_outliers_through_filters(original_indices, filter_stages, outliers)[source]
Maps outlier indices through multiple filtering stages back to the original dataset.
- Parameters:
original_indices (np.ndarray) – Array of original indices before any filtering.
filter_stages (List[np.ndarray]) – List of arrays of indices after each filtering stage.
outliers (np.ndarray) – Outlier indices in the most filtered dataset.
- Returns:
Mapped outlier indices in the original dataset.
- Return type:
np.ndarray
- normalize_target(y, transform_only=False)[source]
Normalize locations, ignoring NaN.
- Parameters:
y (numpy.ndarray) – Array of locations to normalize.
transform_only (bool) – Whether to transform without fitting. Default is False.
- Returns:
Normalized locations.
- Return type:
numpy.ndarray
- property params
Getter for the params dictionary.
- Returns:
Parameters dictionary.
- Return type:
dict
- perform_mca_and_select_components(data, n_components_range, S)[source]
Perform MCA on the provided data and select the optimal number of components.
- Parameters:
data (pd.DataFrame) – The categorical data.
n_components_range (list) – The range of components to explore.
S (float) – Sensitivity setting for selecting optimal number of components.
- Returns:
The optimal number of components.
- Return type:
int
- run_outlier_detection(args, X, indices, y, index, filter_stage_indices)[source]
Performs outlier detection using geographic and genetic criteria.
- Parameters:
args (argparse.Namespace) – User-supplied arguments.
X (numpy.ndarray) – Feature matrix.
indices (numpy.ndarray) – Indices of samples.
y (numpy.ndarray) – Target variable (coordinates).
index (numpy.ndarray) – Index array for samples.
filter_stage_indices (List[np.ndarray]) – List of filter stage indices.
- Returns:
Array of outlier indices.
- Return type:
numpy.ndarray
- Raises:
OutlierDetectionError – If an error occurs during outlier detection.
- select_optimal_components(cumulative_inertia, S)[source]
Select the optimal number of components based on explained inertia.
- Parameters:
cumulative_inertia (list) – The cumulative inertia for each component.
S (float) – Sensitivity setting for selecting optimal number of components.
- Returns:
The optimal number of components.
- Return type:
int
- setup_index_masks(X)[source]
Sets up index masks for filtering the data.
- Parameters:
X (numpy.ndarray) – Feature data.
- Returns:
Filtered feature data, indices, target data, and sample indices.
- Return type:
tuple
- snps_to_012(min_mac=2, max_snps=None, return_values=True)[source]
Convert IUPAC SNPs to 012 encodings.
- Parameters:
min_mac (int) – Minimum minor allele count. Default is 2.
max_snps (int, optional) – Maximum number of SNPs to retain.
return_values (bool) – Whether to return encoded values. Default is True.
- Returns:
Encoded genotypes if return_values is True, otherwise updates internal state.
- Return type:
numpy.ndarray
- sort_samples(sample_data_filename)[source]
Load sample_data and popmap and sort to match VCF file.
- Parameters:
sample_data_filename (str) – Filename of the sample data file.
- Raises:
InvalidSampleDataError – If the sample data file format is incorrect.
- split_train_test(train_split, val_split, seed, args)[source]
Splits the data into training, validation, and test datasets.
- Parameters:
train_split (float) – Proportion of the data to use for training.
val_split (float) – Proportion of the data to use for validation.
seed (int) – Random seed for reproducibility.
args (argparse.Namespace) – Argument namespace containing additional parameters.
- validate_feature_target_len()[source]
Validate that the feature and target datasets have the same length.
- Raises:
InvalidInputShapeError – If the shapes of the feature and target datasets do not match.
geogenie.utils.exceptions module
- exception geogenie.utils.exceptions.DataStructureError[source]
Bases:
ExceptionBase class for exceptions in DataStructure.
- exception geogenie.utils.exceptions.EmbeddingError(message='n_components could not be estimated for embedding.')[source]
Bases:
DataStructureErrorException raised for errors during embedding.
Bases:
ExceptionException raised when a GPU is specified but not available.
- exception geogenie.utils.exceptions.InvalidInputShapeError(shape1, shape2)[source]
Bases:
DataStructureErrorException raised for invalid input shapes.
- exception geogenie.utils.exceptions.InvalidSampleDataError(message='Invalid sample data format. Expected a tab-delimited file with three columns: sampleID, x, and y.')[source]
Bases:
DataStructureErrorException raised for errors in the sample data file.
- exception geogenie.utils.exceptions.OutlierDetectionError(message='Error occurred during outlier detection.')[source]
Bases:
DataStructureErrorException raised for errors during outlier detection.
- exception geogenie.utils.exceptions.ResourceAllocationError(message='Specified resource is not available.')[source]
Bases:
ExceptionException raised when a specified resource is invalid.
- exception geogenie.utils.exceptions.SampleOrderingError(message='Invalid sample ordering after filtering and sorting.')[source]
Bases:
DataStructureErrorException raised for errors in the sample ordering.
geogenie.utils.logger module
- geogenie.utils.logger.setup_logger(log_file, log_level=20)[source]
Function to set up a logger for logging info, warnings, errors, and debug prints.
- Parameters:
log_file (str) – Filename for log file.
log_level (int) – Log level to use. Should be either logging.INFO, logging.WARNING, logging.ERROR, or logging.DEBUG. These levels are converted to integers when called. Defaults to logging.INFO.
geogenie.utils.loss module
- class geogenie.utils.loss.WeightedDRMSLoss(radius=6371)[source]
Bases:
ModuleCustom loss class to compute the Distance Root Mean Square (DRMS) for longitude and latitude coordinates.
- radius
Radius of the Earth in kilometers. Default is 6371 km.
- Type:
float
- forward(preds, targets, sample_weight=None)[source]
Forward pass to compute the Distance Root Mean Square (DRMS) loss.
- Parameters:
preds (torch.Tensor) – Predicted longitude and latitude coordinates.
targets (torch.Tensor) – Actual longitude and latitude coordinates.
sample_weight (torch.Tensor) – Sample weights to make some samples more or less important than others. Defaults to None.
- Returns:
DRMS loss.
- Return type:
torch.Tensor
- class geogenie.utils.loss.WeightedHuberLoss(delta=1.0, smoothing_factor=0.1)[source]
Bases:
ModuleCustom loss class to compute the Weighted Huber Loss.
- delta
The threshold for the Huber loss. Default is 1.0.
- Type:
float
- smoothing_factor
The smoothing factor for the target. Default is 0.1.
- Type:
float
- geogenie.utils.loss.weighted_rmse_loss(y_true, y_pred, sample_weight=None)[source]
Custom PyTorch weighted RMSE loss function.
This method computes the weighted RMSE loss between the ground truth and predictions.
- Parameters:
y_true (torch.Tensor) – Ground truth values.
y_pred (torch.Tensor) – Predictions.
sample_weight (torch.Tensor) – Sample weights (1-dimensional).
- Returns:
Weighted RMSE loss.
- Return type:
float
geogenie.utils.scorers module
- class geogenie.utils.scorers.LocallyLinearEmbeddingWrapper(*, n_neighbors=5, n_components=2, reg=0.001, eigen_solver='auto', tol=1e-06, max_iter=100, method='standard', hessian_tol=0.0001, modified_tol=1e-12, neighbors_algorithm='auto', random_state=None, n_jobs=None)[source]
Bases:
LocallyLinearEmbeddingWrapper class for LocallyLinearEmbedding to allow for prediction.
- _sklearn_auto_wrap_output_keys = {'transform'}
- static lle_reconstruction_scorer(estimator, X, y=None)[source]
Compute the negative reconstruction error for an LLE model to use as a scorer. GridSearchCV assumes that higher score values are better, so the reconstruction error is negated.
- Parameters:
estimator (LocallyLinearEmbedding) – Fitted LLE model.
X (numpy.ndarray) – Original high-dimensional data.
- Returns:
Negative reconstruction error.
- Return type:
float
- geogenie.utils.scorers.calculate_r2_knn(predicted_data, actual_data)[source]
Calculate the coefficient of determination (R^2) for predictions.
- Parameters:
predicted_data (np.array) – Predicted data from KNN.
actual_data (np.array) – Actual data.
- Returns:
R^2 value.
- Return type:
float
- geogenie.utils.scorers.haversine_distance(coord1, coord2)[source]
Calculate the Haversine distance between two geographic coordinate points.
- Parameters:
coord1 (tuple) – Latitude and longitude for each point.
coord2 (tuple) – Latitude and longitude for each point.
- Returns:
Distance in kilometers.
- Return type:
float
geogenie.utils.spatial_data_processors module
- class geogenie.utils.spatial_data_processors.SpatialDataProcessor(output_dir=None, basemap_fips=None, crs='EPSG:4326', logger=None)[source]
Bases:
objectSpatial data processing class.
This class provides a set of tools for processing spatial data, including calculating statistics, distances, and clustering.
- tmpdir
Temporary directory for shapefiles.
- Type:
Path
- output_dir
Output directory for shapefiles.
- Type:
Path
- basemap_fips
FIPS code for the base map.
- Type:
str
- crs
Coordinate reference system.
- Type:
str
- logger
Logger object.
- Type:
logging.Logger
- _ensure_is_gdf(gdf)[source]
Ensure the data is a GeoPandas GeoDataFrame.
- Parameters:
gdf (geopandas.GeoDataFrame) – GeoDataFrame to validate.
- Returns:
Validated GeoDataFrame.
- Return type:
geopandas.GeoDataFrame
- Raises:
TypeError – If the data is not a valid format.
- _ensure_is_numpy(X)[source]
Ensure the data is a NumPy array.
- Parameters:
X (np.ndarray) – Data to validate.
- Returns:
Validated NumPy array.
- Return type:
np.ndarray
- Raises:
TypeError – If the data is not a valid format.
- _ensure_is_pandas(df)[source]
Ensure the data is a Pandas DataFrame.
- Parameters:
df (pandas.DataFrame) – Data to validate.
- Returns:
Validated DataFrame.
- Return type:
pandas.DataFrame
- Raises:
TypeError – If the data is not a valid format.
- calculate_statistics(gdf, max_boots=None, seed=None, known_coords=None)[source]
Calculate statistics on properly projected data.
- extract_basemap_path_url(url)[source]
Extract base map from provided URL or file path.
- Parameters:
url (str) – URL or file path to extract base map from.
- Returns:
Extracted base map path or URL.
- Return type:
str
- Raises:
ValueError – If the base map FIPS code is not provided.
- geodesic_distance(coords1, coords2)[source]
Calculate geodesic distance between two sets of coordinates.
- haversine_distance(coords1, coords2)[source]
Calculate haversine distance between two sets of points.
- Parameters:
coords1 (np.ndarray) – First set of coordinates.
coords2 (np.ndarray) – Second set of coordinates.
- Returns:
Haversine distance between coords1 and coords2.
- Return type:
np.ndarray
- haversine_error(point1, point2)[source]
Calculate the haversine error between two geographic points.
- Parameters:
point1 (float) – First point to calculate error from.
point2 (float) – Second point to calculate error from.
- Returns:
Haversine error between points 1 and 2.
- Return type:
np.ndarray
- to_geopandas(df)[source]
Convert DataFrame to GeoPandas GeoDataFrame with proper CRS.
- Parameters:
df (pandas.DataFrame) – DataFrame to convert to geopandas.GeoDataFrame.
- Returns:
Converted GeoDataFrame object.
- Return type:
geopandas.GeoDataFrame
geogenie.utils.transformers module
- class geogenie.utils.transformers.MCA(n_components=2, n_iter=10, check_input=True, random_state=None, one_hot=True, categories=[0, 1, 2], epsilon=1e-05)[source]
Bases:
BaseEstimator,TransformerMixinClass to perform Multiple Correspondence Analayis (MCA).
This class performs Multiple Correspondence Analysis (MCA) on the input data.
- n_components
Number of MCA components to output.
- Type:
int
- n_iter
Number of randomized SVD iterations to perform.
- Type:
int
- check_input
Whether to check input data for conformity.
- Type:
bool
- random_state
Random state for reproducibility.
- Type:
int or None
- one_hot
Flag for one-hot encoding the input data.
- Type:
bool
- categories
Possible categories in input features.
- Type:
list
- epsilon
Small value to prevent division by 0.
- Type:
float
- logger
Logger object for the class
- Type:
logging.Logger
- _compute_S_matrix(X)[source]
Compute the S matrix.
- Parameters:
X (np.ndarray) – The input data.
- Returns:
The S matrix
- Return type:
sp.spmatrix
- _normalize_data(X)[source]
Normalize the input data.
- Parameters:
X (np.ndarray) – Array to normalize.
- Returns:
Normalized array.
- Return type:
np.ndarray
- _sklearn_auto_wrap_output_keys = {'transform'}
- class geogenie.utils.transformers.MinMaxScalerGeo(lat_range=(-90, 90), lon_range=(-180, 180), scale_min=0, scale_max=1)[source]
Bases:
BaseEstimator,TransformerMixinClass to scale geographic coordinates to a specified range.
- lat_range
Minimum and maximum values for latitude.
- Type:
tuple
- lon_range
Minimum and maximum values for longitude.
- Type:
tuple
- scale_min
Minimum value of the scaled range.
- Type:
float
- scale_max
Maximum value of the scaled range.
- Type:
float
- logger
Logger object for the class.
- Type:
logging.Logger
- _sklearn_auto_wrap_output_keys = {'transform'}
- fit(X, y=None)[source]
Fit does nothing as parameters are not data-dependent.
- Parameters:
X (array-like) – The data to fit. Ignored. This parameter exists only for compatibility with the sklearn API.
y (None, optional) – Ignored. This parameter exists only for compatibility with the sklearn API.
- Returns:
Returns the instance itself.
- Return type:
self
- inverse_transform(X_scaled)[source]
Scale back the coordinates to their original range.
- Parameters:
X_scaled (array-like) – The scaled coordinates to revert.
- Returns:
Original geographic coordinates.
- Return type:
np.array
- set_inverse_transform_request(*, X_scaled: bool | None | str = '$UNCHANGED$') MinMaxScalerGeo
Request metadata passed to the
inverse_transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toinverse_transformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toinverse_transform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
X_scaled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_scaledparameter ininverse_transform.- Returns:
self – The updated object.
- Return type:
object
- transform(X)[source]
Scale the geographic coordinates based on the provided ranges.
- Parameters:
X (array-like) – The input coordinates to transform. Expected shape (n_samples, 2) where X[:, 0] should be longitude and X[:, 1] should be latitude.
- Returns:
Transformed coordinates, where each feature is scaled to [scale_min, scale_max].
- Return type:
np.array
geogenie.utils.utils module
- geogenie.utils.utils.assign_to_bins(df, long_col, lat_col, n_bins, args, method='optics', min_samples=None, random_state=None)[source]
Assign longitude and latitude coordinates to bins based on K-Means or DBSCAN clustering.
- Parameters:
df (pd.DataFrame) – DataFrame containing longitude and latitude columns.
long_col (str) – Name of the column containing longitude values.
lat_col (str) – Name of the column containing latitude values.
n_bins (int) – Number of bins (clusters) for KMeans or minimum samples for OPTICS.
method (str) – Clustering method (‘kmeans’ or ‘optics’). Defaults to ‘optics’.
min_samples (int) – Minimum number of samples for OPTICS. Defaults to None (4).
random_state (int or RandomState) – Random seed or state for reproducibility. Defaults to None (new seed each time).
- Returns:
Numpy array with bin indices indicating the bin assignment.
- Return type:
np.ndarray
- geogenie.utils.utils.check_column_dtype(df, column_name)[source]
Check if a DataFrame column is string dtype or numeric dtype.
- Parameters:
df (pd.DataFrame) – The DataFrame to check.
column_name (str) – The name of the column to check.
- Returns:
‘string’ if the column is of string dtype, ‘numeric’ if the column is of numeric dtype, otherwise ‘other’.
- Return type:
str
- geogenie.utils.utils.detect_separator(file_path)[source]
Detects the separator used in a CSV file by reading the first line.
- Parameters:
file_path (str) – Path to the CSV file.
- Returns:
The detected separator (one of ‘,’, ‘ ‘, or ‘ ‘).
- Return type:
str
- geogenie.utils.utils.geo_coords_is_valid(coordinates)[source]
Validates that a given NumPy array contains valid geographic coordinates.
- Parameters:
coordinates (np.ndarray) – A NumPy array of shape (n_samples, 2) where the first column is longitude and the second is latitude.
- Raises:
ValueError – If the array shape is not (n_samples, 2), or if the longitude and latitude values are not in their respective valid ranges.
- Returns:
True if the validation passes, indicating the coordinates are valid.
- Return type:
bool
- geogenie.utils.utils.read_csv_with_dynamic_sep(file_path)[source]
Reads a CSV file with a dynamically detected separator, replacing spaces with tabs in-memory.
- Parameters:
file_path (str) – Path to the CSV file.
- Returns:
The DataFrame containing the CSV data.
- Return type:
pd.DataFrame