geogenie.outliers package
Submodules
geogenie.outliers.detect_outliers module
- class geogenie.outliers.detect_outliers.GeoGeneticOutlierDetector(args, genetic_data, geographic_data, output_dir, prefix, n_jobs=-1, seed=None, url='https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_500k.zip', buffer=0.1, show_plots=False, debug=False, verbose=0)[source]
Bases:
objectA class to detect outliers based on genomic SNPs and geographic coordinates.
This class uses a K-nearest neighbors (KNN) approach to detect outliers based on the difference between predicted and actual data.
- args
Command line arguments.
- Type:
argparse.Namespace
- genetic_data
The SNP data as a DataFrame.
- Type:
pd.DataFrame
- geographic_data
geographic coordinates as 2D array.
- Type:
np.array
- output_dir
Output directory.
- Type:
str
- prefix
Prefix for output files.
- Type:
str
- n_jobs
Number of parallel jobs to run.
- Type:
int
- seed
Random seed to use.
- Type:
int
- url
url to download base map for plots.
- Type:
str
- buffer
Buffer to put around sampling area on base map when plotting.
- Type:
float
- show_plots
Whether to show plots in-line.
- Type:
bool
- debug
If True, writes genetic_data and geographic_data to file.
- Type:
bool
- verbose
Verbosity setting (0-2), least to most verbose.
- Type:
int
- logger
Logger object.
- Type:
logging.Logger
- analysis(geo_coords, gen_coords, analysis_type, sig_level, min_nn_dist, scale_factor, max_iter, opt_ks, res, at, time_duration)[source]
Perform outlier detection analysis for genetic or geographic data.
This method performs outlier detection analysis for genetic or geographic data.
- Parameters:
geo_coords (np.array) – Array of geographic coordinates.
gen_coords (np.array) – Array of genetic data coordinates.
analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).
sig_level (float) – Significance level for detecting outliers.
min_nn_dist (int) – Minimum distance required to consider points.
scale_factor (int) – Scaling factor for geo coordinates.
max_iter (int) – Maximum number of iterations.
opt_ks (dict) – Optimal K values for genetic and geographic data.
res (dict) – Dictionary to store results.
at (str) – Analysis type (genetic, geographic).
time_duration (dict) – Dictionary to store run times for each method.
- calculate_dgeo(pred_geo_coords, geo_coords, scalar)[source]
Calculate the Dgeo statistic for geographic coordinates.
This method calculates the geographic distance between predicted and actual geographic coordinates.
- Parameters:
pred_geo_coords (np.array) – Predicted geographic coordinates.
geo_coords (np.array) – Actual geographic coordinates.
scalar (float) – Scalar value to divide the distance.
- Returns:
Calculated Dgeo values.
- Return type:
np.array
- calculate_statistic(predicted_data, actual_data, is_genetic, min_nn_dist, scale_factor)[source]
Calculate the Dg or Dgeo statistic based on the difference between predicted and actual data.
This method calculates the Dg or Dgeo statistic based on the mean squared error or geographic distance between predicted and actual data.
- Parameters:
predicted_data (np.array) – Predicted data from KNN.
actual_data (np.array) – Actual data.
is_genetic (bool) – Flag to determine if the calculation is for genetic data.
min_nn_dist (float) – The minimum distance to consider between geographic points.
scale_factor (float) – Scaling factor for geo_coords.
- Returns:
Dg or Dgeo statistic for each sample.
- Return type:
np.array
- composite_outlier_detection(sig_level=0.05, maxk=50, min_nn_dist=1000, scale_factor=100, w_power=2)[source]
Perform composite outlier detection using the KNN approach.
This method performs composite outlier detection using the KNN approach.
- Parameters:
sig_level (float) – Significance level for detecting outliers.
maxk (int) – Maximum number of nearest neighbors to consider.
min_nn_dist (int) – Minimum distance required to consider points.
scale_factor (int) – Scaling factor for geo coordinates.
w_power (float) – Power of distance weight in KNN prediction.
- Returns:
Detected outliers for geographic and genetic data.
- Return type:
dict
- detect_outliers(geo_coords, gen_coords, optk, time_durations, w_power=2, sig_level=0.05, min_nn_dist=1000, scale_factor=100, analysis_type='genetic')[source]
Detect outliers based on composite data using the KNN approach.
This method detects outliers based on composite data using the KNN approach.
- Parameters:
geo_coords (np.array) – Array of geographic coordinates.
gen_coords (np.array) – Array of genetic data coordinates.
optk (int) – Optimal K for nearest neigbbors (geographic).
time_durations (dict) – Dictionary storing run times for each method.
w_power (float) – Power of distance weight in KNN prediction.
sig_level (float) – Significance level for detecting outliers.
min_nn_dist (int) – Minimum distance required to consider points.
scale_factor (int) – Scaling factor for geo coordinates.
analysis_type (str) – Either ‘genetic’ or ‘geographic’.
- Returns:
Indices of detected outliers, p-values, and gamma parameters for geographic and genetic outliers.
- Return type:
tuple
- Returns:
Indices of detected outliers, p-values, and gamma parameters for geographic and genetic outliers.
- Return type:
tuple
- filter_and_detect(geo_coords, gen_coords, sig_level, min_nn_dist, scale_factor, optk, outlier_flags, time_durations, analysis_type)[source]
Filter outliers and detect new outliers using KNN.
This method filters outliers and detects new outliers using KNN for genetic and geographic data.
- Parameters:
geo_coords (np.array) – Array of geographic coordinates.
gen_coords (np.array) – Array of genetic data coordinates.
sig_level (float) – Significance level for detecting outliers.
min_nn_dist (int) – Minimum distance required to consider points.
scale_factor (int) – Scaling factor for geo coordinates.
optk (int) – Optimal K value for nearest neighbors.
outlier_flags (np.array) – Array of outlier flags.
time_durations (dict) – Dictionary to store run times for each method.
analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).
- Returns:
Time durations, outlier flags, D statistics, p-values, r-squared values, gamma parameters, and filtered indices
- Return type:
tuple
- find_gen_knn(coords, k, scale_factor)[source]
Find K-nearest neighbors for genetic data using PyNNDescent.
This method finds the K-nearest neighbors for genetic data using PyNNDescent.
- Parameters:
dist_matrix (np.array) – Distance matrix.
k (int) – Number of neighbors.
- Returns:
Indices of K-nearest neighbors.
- Return type:
np.array
- find_geo_knn(coords, k, min_nn_dist)[source]
Find K-nearest neighbors for geographic data considering minimum neighbor distance using PyNNDescent.
This method finds the K-nearest neighbors for geographic data using PyNNDescent.
- Parameters:
dist_matrix (np.array) – Distance matrix.
k (int) – Number of neighbors.
min_nn_dist (float) – Minimum distance to consider for neighbors.
- Returns:
Indices of K-nearest neighbors.
- Return type:
np.array
- find_optimal_k(geo_coords, gen_coords, klim, w_power, min_nn_dist, is_genetic, scale_factor)[source]
Find optimal number of nearest neighbors for KNN.
This method finds the optimal number of nearest neighbors for KNN based on the Dg or Dgeo statistic.
- Parameters:
geo_coords (np.array) – Geographic coordinates.
gen_coords (np.array) – Genetic coordinatees.
dist_matrix (np.array) – Distance matrix.
klim (tuple) – Range of K values to search (min_k, max_k).
w_power (float) – Power for distance weighting.
min_nn_dist (int) – Minimum nearest neighbor distance to consider points.
is_genetic (bool) – Flag to determine if the calculation is for genetic data as distance matrix.
scale_factor (float) – Factor to scale geo_coords by.
- Returns:
Optimal K value.
- Return type:
int
- fit_gamma_mle(D_statistic, sig_level)[source]
Detect outliers using a Gamma distribution fitted to the Dg or Dgeo statistic.
This method detects outliers using a Gamma distribution fitted to the Dg or Dgeo statistic.
- Parameters:
D_statistic (np.array) – Dg or Dgeo statistic for each sample.
Dgeo (np.array) – For determining initial_shape and initial_rate.
sig_level (float) – Significance level for detecting outliers.
- Returns:
Indices of outliers, p-values, and fitted Gamma parameters.
- Return type:
tuple
- gamma_neg_log_likelihood(params, data)[source]
Negative log likelihood for gamma distribution.
This method calculates the negative log likelihood value for the gamma distribution.
- Parameters:
params (tuple) – Contains the shape and rate parameters for the gamma distribution.
data (np.array) – Data to fit the gamma distribution to.
- Returns:
Negative log likelihood value.
- Return type:
float
- multi_stage_outlier_knn(geo_coords, gen_coords, analysis_type='composite', sig_level=0.05, maxk=50, w_power=2, min_nn_dist=1000, scale_factor=100)[source]
Iterative Outlier Detection via KNN for genetic and geographic data.
This method performs iterative outlier detection using KNN for genetic and geographic data.
- Parameters:
geo_coords (np.array) – Array of geographic coordinates.
gen_coords (np.array) – Array of genetic data coordinates.
analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).
sig_level (float) – Significance level for detecting outliers.
maxk (int) – Maximum number of nearest neighbors to consider.
w_power (float) – Power of distance weight in KNN prediction.
min_nn_dist (int) – Minimum distance required to consider points.
scale_factor (int) – Scaling factor for geo coordinates.
- Returns:
Indices of detected outliers for geographic and genetic data.
- Return type:
tuple
- plot_gamma_dist(sig_level, d_stats, gamma_params, dtype)[source]
Plot the gamma distribution for detected outliers.
This method plots the gamma distribution for detected outliers based on the Dg or Dgeo statistic.
- Parameters:
sig_level (float) – Significance level for detecting outliers.
d_stats (np.array) – Dg or Dgeo statistic for each sample.
gamma_params (tuple) – Parameters for the gamma distribution.
dtype (str) – Type of data (genetic, geographic
- predict_coords_knn(coords, knn_distances, knn_indices, w_power)[source]
Predict coordinates data using weighted KNN.
This method predicts coordinates data using weighted KNN based on the distances to the K-nearest neighbors.
- Parameters:
coords (np.array) – Array of genetic or geographic coordinates.
knn_distances (np.array) – Distances to K-nearest neighbors.
knn_indices (np.array) – Indices of K-nearest neighbors.
w_power (float) – Power of distance weight in prediction.
- Returns:
Predicted coordinates using weighted KNN.
- Return type:
np.array
- rescale_statistic(Dgeo, s, orig_min_nn_dist, max_threshold=20)[source]
Rescales the Dgeo array to avoid large values that might cause errors in maximum likelihood estimation.
This method rescales the Dgeo array to avoid large values that might cause errors in maximum likelihood estimation.
- Parameters:
Dgeo (np.ndarray) – An array representing geographic distances or differences.
s (float) – A scalar value used in calculations.
orig_min_nn_dist (float) – Original minimum nearest neighbor distance.
max_threshold (int) – Maximum Dgeo value to trigger rescaling.
- Returns:
adjusted scalar value, and adjusted minimum nearest neighbor distance.
- Return type:
float, float
- run_multistage(geo_coords, gen_coords, sig_level, min_nn_dist, scale_factor, max_iter, optk, analysis_type, time_durations)[source]
Run multi-stage outlier detection using KNN.
This method runs multi-stage outlier detection using KNN for genetic and geographic data.
- Parameters:
geo_coords (np.array) – Array of geographic coordinates.
gen_coords (np.array) – Array of genetic data coordinates.
sig_level (float) – Significance level for detecting outliers.
min_nn_dist (int) – Minimum distance required to consider points.
scale_factor (int) – Scaling factor for geo coordinates.
max_iter (int) – Maximum number of iterations.
optk (int) – Optimal K value for nearest neighbors.
analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).
time_durations (dict) – Dictionary to store run times for each method.
- Returns:
Time durations, outlier flags, D statistics, p-values, r-squared values, and gamma parameters
- Return type:
tuple
- search_nn_optk(geo_coords, gen_coords, maxk, w_power, min_nn_dist, scale_factor, analysis_type)[source]
Search for optimal K for nearest neighbors.
This method searches for the optimal K for nearest neighbors based on the Dg or Dgeo statistic.
- Parameters:
geo_coords (np.array) – Array of geographic coordinates.
gen_coords (np.array) – Array of genetic data coordinates.
maxk (int) – Maximum number of nearest neighbors to consider.
w_power (float) – Power of distance weight in KNN prediction.
min_nn_dist (int) – Minimum distance required to consider points.
scale_factor (int) – Scaling factor for geo coordinates.
analysis_type (str) – Type of analysis to perform (genetic, geographic, composite).
- Returns:
Time durations, optimal K for genetic data, and optimal K for geographic data.
- Return type:
tuple