geogenie.outliers package

Submodules

geogenie.outliers.detect_outliers module

class geogenie.outliers.detect_outliers.GeoGeneticOutlierDetector(args, genetic_data, geographic_data, output_dir, prefix, n_jobs=-1, seed=None, url='https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip', buffer=0.1, show_plots=False, debug=False, verbose=0)[source]

Bases: object

A class to detect outliers based on genomic SNPs and geographic coordinates.

This class uses a K-nearest neighbors (KNN) approach to detect outliers based on the difference between predicted and actual data.

args

Command line arguments.

Type:

argparse.Namespace

genetic_data

The SNP data as a DataFrame.

Type:

pd.DataFrame

geographic_data

geographic coordinates as 2D array.

Type:

np.array

output_dir

Output directory.

Type:

str

prefix

Prefix for output files.

Type:

str

n_jobs

Number of parallel jobs to run.

Type:

int

seed

Random seed to use.

Type:

int

url

url to download base map for plots.

Type:

str

buffer

Buffer to put around sampling area on base map when plotting.

Type:

float

show_plots

Whether to show plots in-line.

Type:

bool

debug

If True, writes genetic_data and geographic_data to file.

Type:

bool

verbose

Verbosity setting (0-2), least to most verbose.

Type:

int

logger

Logger object.

Type:

logging.Logger

plotting

Plotting object.

Type:

PlotGenIE

analysis(geo_coords, gen_coords, analysis_type, sig_level, min_nn_dist, scale_factor, max_iter, opt_ks, res, at, time_duration)[source]

Perform outlier detection analysis for genetic or geographic data.

This method performs outlier detection analysis for genetic or geographic data.

Parameters:
  • geo_coords (np.array) – Array of geographic coordinates.

  • gen_coords (np.array) – Array of genetic data coordinates.

  • analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).

  • sig_level (float) – Significance level for detecting outliers.

  • min_nn_dist (int) – Minimum distance required to consider points.

  • scale_factor (int) – Scaling factor for geo coordinates.

  • max_iter (int) – Maximum number of iterations.

  • opt_ks (dict) – Optimal K values for genetic and geographic data.

  • res (dict) – Dictionary to store results.

  • at (str) – Analysis type (genetic, geographic).

  • time_duration (dict) – Dictionary to store run times for each method.

calculate_dgeo(pred_geo_coords, geo_coords, scalar)[source]

Calculate the Dgeo statistic for geographic coordinates.

This method calculates the geographic distance between predicted and actual geographic coordinates.

Parameters:
  • pred_geo_coords (np.array) – Predicted geographic coordinates.

  • geo_coords (np.array) – Actual geographic coordinates.

  • scalar (float) – Scalar value to divide the distance.

Returns:

Calculated Dgeo values.

Return type:

np.array

calculate_statistic(predicted_data, actual_data, is_genetic, min_nn_dist, scale_factor)[source]

Calculate the Dg or Dgeo statistic based on the difference between predicted and actual data.

This method calculates the Dg or Dgeo statistic based on the mean squared error or geographic distance between predicted and actual data.

Parameters:
  • predicted_data (np.array) – Predicted data from KNN.

  • actual_data (np.array) – Actual data.

  • is_genetic (bool) – Flag to determine if the calculation is for genetic data.

  • min_nn_dist (float) – The minimum distance to consider between geographic points.

  • scale_factor (float) – Scaling factor for geo_coords.

Returns:

Dg or Dgeo statistic for each sample.

Return type:

np.array

composite_outlier_detection(sig_level=0.05, maxk=50, min_nn_dist=1000, scale_factor=100, w_power=2)[source]

Perform composite outlier detection using the KNN approach.

This method performs composite outlier detection using the KNN approach.

Parameters:
  • sig_level (float) – Significance level for detecting outliers.

  • maxk (int) – Maximum number of nearest neighbors to consider.

  • min_nn_dist (int) – Minimum distance required to consider points.

  • scale_factor (int) – Scaling factor for geo coordinates.

  • w_power (float) – Power of distance weight in KNN prediction.

Returns:

Detected outliers for geographic and genetic data.

Return type:

dict

detect_outliers(geo_coords, gen_coords, optk, time_durations, w_power=2, sig_level=0.05, min_nn_dist=1000, scale_factor=100, analysis_type='genetic')[source]

Detect outliers based on composite data using the KNN approach.

This method detects outliers based on composite data using the KNN approach.

Parameters:
  • geo_coords (np.array) – Array of geographic coordinates.

  • gen_coords (np.array) – Array of genetic data coordinates.

  • optk (int) – Optimal K for nearest neigbbors (geographic).

  • time_durations (dict) – Dictionary storing run times for each method.

  • w_power (float) – Power of distance weight in KNN prediction.

  • sig_level (float) – Significance level for detecting outliers.

  • min_nn_dist (int) – Minimum distance required to consider points.

  • scale_factor (int) – Scaling factor for geo coordinates.

  • analysis_type (str) – Either ‘genetic’ or ‘geographic’.

Returns:

Indices of detected outliers, p-values, and gamma parameters for geographic and genetic outliers.

Return type:

tuple

Returns:

Indices of detected outliers, p-values, and gamma parameters for geographic and genetic outliers.

Return type:

tuple

filter_and_detect(geo_coords, gen_coords, sig_level, min_nn_dist, scale_factor, optk, outlier_flags, time_durations, analysis_type)[source]

Filter outliers and detect new outliers using KNN.

This method filters outliers and detects new outliers using KNN for genetic and geographic data.

Parameters:
  • geo_coords (np.array) – Array of geographic coordinates.

  • gen_coords (np.array) – Array of genetic data coordinates.

  • sig_level (float) – Significance level for detecting outliers.

  • min_nn_dist (int) – Minimum distance required to consider points.

  • scale_factor (int) – Scaling factor for geo coordinates.

  • optk (int) – Optimal K value for nearest neighbors.

  • outlier_flags (np.array) – Array of outlier flags.

  • time_durations (dict) – Dictionary to store run times for each method.

  • analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).

Returns:

Time durations, outlier flags, D statistics, p-values, r-squared values, gamma parameters, and filtered indices

Return type:

tuple

find_gen_knn(coords, k, scale_factor)[source]

Find K-nearest neighbors for genetic data using PyNNDescent.

This method finds the K-nearest neighbors for genetic data using PyNNDescent.

Parameters:
  • dist_matrix (np.array) – Distance matrix.

  • k (int) – Number of neighbors.

Returns:

Indices of K-nearest neighbors.

Return type:

np.array

find_geo_knn(coords, k, min_nn_dist)[source]

Find K-nearest neighbors for geographic data considering minimum neighbor distance using PyNNDescent.

This method finds the K-nearest neighbors for geographic data using PyNNDescent.

Parameters:
  • dist_matrix (np.array) – Distance matrix.

  • k (int) – Number of neighbors.

  • min_nn_dist (float) – Minimum distance to consider for neighbors.

Returns:

Indices of K-nearest neighbors.

Return type:

np.array

find_optimal_k(geo_coords, gen_coords, klim, w_power, min_nn_dist, is_genetic, scale_factor)[source]

Find optimal number of nearest neighbors for KNN.

This method finds the optimal number of nearest neighbors for KNN based on the Dg or Dgeo statistic.

Parameters:
  • geo_coords (np.array) – Geographic coordinates.

  • gen_coords (np.array) – Genetic coordinatees.

  • dist_matrix (np.array) – Distance matrix.

  • klim (tuple) – Range of K values to search (min_k, max_k).

  • w_power (float) – Power for distance weighting.

  • min_nn_dist (int) – Minimum nearest neighbor distance to consider points.

  • is_genetic (bool) – Flag to determine if the calculation is for genetic data as distance matrix.

  • scale_factor (float) – Factor to scale geo_coords by.

Returns:

Optimal K value.

Return type:

int

fit_gamma_mle(D_statistic, sig_level)[source]

Detect outliers using a Gamma distribution fitted to the Dg or Dgeo statistic.

This method detects outliers using a Gamma distribution fitted to the Dg or Dgeo statistic.

Parameters:
  • D_statistic (np.array) – Dg or Dgeo statistic for each sample.

  • Dgeo (np.array) – For determining initial_shape and initial_rate.

  • sig_level (float) – Significance level for detecting outliers.

Returns:

Indices of outliers, p-values, and fitted Gamma parameters.

Return type:

tuple

gamma_neg_log_likelihood(params, data)[source]

Negative log likelihood for gamma distribution.

This method calculates the negative log likelihood value for the gamma distribution.

Parameters:
  • params (tuple) – Contains the shape and rate parameters for the gamma distribution.

  • data (np.array) – Data to fit the gamma distribution to.

Returns:

Negative log likelihood value.

Return type:

float

multi_stage_outlier_knn(geo_coords, gen_coords, analysis_type='composite', sig_level=0.05, maxk=50, w_power=2, min_nn_dist=1000, scale_factor=100)[source]

Iterative Outlier Detection via KNN for genetic and geographic data.

This method performs iterative outlier detection using KNN for genetic and geographic data.

Parameters:
  • geo_coords (np.array) – Array of geographic coordinates.

  • gen_coords (np.array) – Array of genetic data coordinates.

  • analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).

  • sig_level (float) – Significance level for detecting outliers.

  • maxk (int) – Maximum number of nearest neighbors to consider.

  • w_power (float) – Power of distance weight in KNN prediction.

  • min_nn_dist (int) – Minimum distance required to consider points.

  • scale_factor (int) – Scaling factor for geo coordinates.

Returns:

Indices of detected outliers for geographic and genetic data.

Return type:

tuple

plot_gamma_dist(sig_level, d_stats, gamma_params, dtype)[source]

Plot the gamma distribution for detected outliers.

This method plots the gamma distribution for detected outliers based on the Dg or Dgeo statistic.

Parameters:
  • sig_level (float) – Significance level for detecting outliers.

  • d_stats (np.array) – Dg or Dgeo statistic for each sample.

  • gamma_params (tuple) – Parameters for the gamma distribution.

  • dtype (str) – Type of data (genetic, geographic

predict_coords_knn(coords, knn_distances, knn_indices, w_power)[source]

Predict coordinates data using weighted KNN.

This method predicts coordinates data using weighted KNN based on the distances to the K-nearest neighbors.

Parameters:
  • coords (np.array) – Array of genetic or geographic coordinates.

  • knn_distances (np.array) – Distances to K-nearest neighbors.

  • knn_indices (np.array) – Indices of K-nearest neighbors.

  • w_power (float) – Power of distance weight in prediction.

Returns:

Predicted coordinates using weighted KNN.

Return type:

np.array

rescale_statistic(Dgeo, s, orig_min_nn_dist, max_threshold=20)[source]

Rescales the Dgeo array to avoid large values that might cause errors in maximum likelihood estimation.

This method rescales the Dgeo array to avoid large values that might cause errors in maximum likelihood estimation.

Parameters:
  • Dgeo (np.ndarray) – An array representing geographic distances or differences.

  • s (float) – A scalar value used in calculations.

  • orig_min_nn_dist (float) – Original minimum nearest neighbor distance.

  • max_threshold (int) – Maximum Dgeo value to trigger rescaling.

Returns:

adjusted scalar value, and adjusted minimum nearest neighbor distance.

Return type:

float, float

run_multistage(geo_coords, gen_coords, sig_level, min_nn_dist, scale_factor, max_iter, optk, analysis_type, time_durations)[source]

Run multi-stage outlier detection using KNN.

This method runs multi-stage outlier detection using KNN for genetic and geographic data.

Parameters:
  • geo_coords (np.array) – Array of geographic coordinates.

  • gen_coords (np.array) – Array of genetic data coordinates.

  • sig_level (float) – Significance level for detecting outliers.

  • min_nn_dist (int) – Minimum distance required to consider points.

  • scale_factor (int) – Scaling factor for geo coordinates.

  • max_iter (int) – Maximum number of iterations.

  • optk (int) – Optimal K value for nearest neighbors.

  • analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).

  • time_durations (dict) – Dictionary to store run times for each method.

Returns:

Time durations, outlier flags, D statistics, p-values, r-squared values, and gamma parameters

Return type:

tuple

search_nn_optk(geo_coords, gen_coords, maxk, w_power, min_nn_dist, scale_factor, analysis_type)[source]

Search for optimal K for nearest neighbors.

This method searches for the optimal K for nearest neighbors based on the Dg or Dgeo statistic.

Parameters:
  • geo_coords (np.array) – Array of geographic coordinates.

  • gen_coords (np.array) – Array of genetic data coordinates.

  • maxk (int) – Maximum number of nearest neighbors to consider.

  • w_power (float) – Power of distance weight in KNN prediction.

  • min_nn_dist (int) – Minimum distance required to consider points.

  • scale_factor (int) – Scaling factor for geo coordinates.

  • analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).

Returns:

Time durations, optimal K for genetic data, and optimal K for geographic data.

Return type:

tuple

Module contents