geogenie.plotting package

Submodules

geogenie.plotting.plotting module

class geogenie.plotting.plotting.PlotGenIE(device, output_dir, prefix, basemap_fips, basemap_highlights, url, show_plots=False, fontsize=18, filetype='png', dpi=300, remove_splines=False)[source]

Bases: object

A class dedicated to generating and managing plots for data visualization.

This class is designed to generate a variety of plots for visualizing data, including geographical data, model training times, and model performance metrics. It provides methods for creating plots, saving them to disk, and displaying them inline.

device

The device used for plotting, typically ‘cpu’ or ‘cuda’.

Type:

str

output_dir

The directory where plots will be saved.

Type:

str

prefix

A prefix added to the names of saved plot files.

Type:

str

basemap_fips

FIPS code for base map.

Type:

str

basemap_highlights

List of counties to highlight gray on base map.

Type:

list

show_plots

Flag to determine if plots should be displayed inline. Defaults to False.

Type:

bool

fontsize

Font size used in plots. Defaults to 18.

Type:

int

filetype

File type/format

Type:

str

_fill_kde_with_gradient(xdata, cmap, norm, ax=None, ydata=None)[source]

Fill a KDE plot with a gradient following the X-axis values.

Parameters:
  • xdata (np.ndarray) – X-axis values to plot.

  • cmap (matplotlib.colors.cmap) – Matplotlib colormap to use.

  • norm (matplotlob.colors.Normalize) – Normalizer for color gradient.

  • ax (matplotlib.pyplot.Axes or None) – Matplotlib axis to use. If ydata is None, then ax must be provided. Defaults to None.

  • ydata (np.ndarray) – Y-axis values to plot. If None, then gets the y-axis values from the provided ax object. Defaults to None.

Raises:

TypeError – If ydata is None and ax is not provided.

_highlight_counties(gray_counties, gdf, ax=None)[source]

Highlight specific counties in the basemap.

Parameters:
  • gray_counties (list) – List of counties to highlight.

  • gdf (geopandas.GeoDataFrame) – GeoDataFrame of the basemap.

  • ax (matplotlib.pyplot.Axes) – Matplotlib axis to use. Defaults to None.

Returns:

GeoDataFrame of the highlighted counties.

Return type:

geopandas.GeoDataFrame

Notes

  • This method highlights specific counties in the basemap by coloring them gray.

_make_colorbar(min_colorscale, max_colorscale, n_contour_levels, ax, contour)[source]

Creates and configures a colorbar for contour plots.

This method generates a colorbar for contour plots with specified min and max values, and a defined number of levels.

Parameters:
  • min_colorscale (int) – Minimum value for the color scale.

  • max_colorscale (int) – Maximum value for the color scale.

  • n_contour_levels (int) – Number of contour levels in the plot.

  • ax (matplotlib.axes.Axes) – The matplotlib Axes object for the plot.

  • contour (matplotlib.contour.QuadContourSet) – The contour plot object.

Returns:

The colorbar object.

Return type:

matplotlib.colorbar.Colorbar

Notes

  • This method sets up a colorbar with specified min and max values, and a defined number of levels.

_plot_scatter_map(dataset, ax, coords, exp_factor, mult_factor=1.0, label='Samples', alpha=0.5, color='blue')[source]

Plots a scatter map of coordinates, with the size of each point representing a certain attribute (e.g., sample weight).

Parameters:
  • dataset (str) – Name of the dataset being used.

  • ax (matplotlib.axes.Axes) – The matplotlib Axes object for the plot.

  • coords (np.array) – Array of coordinates to be plotted.

  • exp_factor (int) – Exponent factor for scaling the size of the markers.

  • mult_factor (float, optional) – Multiplicative factor for marker size. Defaults to 1.0.

  • label (str, optional) – Label for the plotted points. Defaults to “Samples”.

  • alpha (float, optional) – Alpha transparency for the markers. Defaults to 0.5.

  • color (str, optional) – Color of the markers. Defaults to “darkorchid”.

Returns:

The scatter plot object.

Return type:

matplotlib.collections.PathCollection

Notes

  • This method is used for creating scatter plots on geographical maps with customizable marker sizes and colors.

_plot_smote_scatter(df, bins, ax, title, loc)[source]

Creates a scatter plot for visualizing data points with their associated bin labels.

This method is used internally by plot_smote_bins to generate individual scatter plots.

Parameters:
  • df (pandas DataFrame) – DataFrame containing the data to be plotted.

  • bins (array-like) – Array of bin labels for the data.

  • ax (matplotlib.axes.Axes) – The matplotlib Axes object where the plot will be drawn.

  • title (str) – Title of the scatter plot.

  • loc (str) – Location of the legend in the plot.

Notes

  • This function is a helper method and is not intended to be called directly.

  • It adds a scatter plot to the provided Axes object, with data points colored by their bin labels.

_remove_spines(ax)[source]

Remove spines from a plot.

This method removes spines from a plot, ensuring that only the top and right spines are included.

Parameters:

ax (matplotlib.axes.Axes) – The matplotlib Axes object where the plot will be drawn.

Returns:

The Axes object with spines removed.

Return type:

matplotlib.axes.Axes

_run_kriging(actual_coords, haversine_errors, xmin, ymin, xmax, ymax, buffer)[source]

Performs Ordinary Kriging on prediction errors to estimate error and uncertainty across a geographical area.

This method uses Ordinary Kriging to interpolate prediction errors across a geographical area. It generates a grid of coordinates for interpolation and returns the interpolated error predictions and uncertainties.

Parameters:
  • actual_coords (np.array) – Array of actual geographical coordinates.

  • haversine_errors (np.array) – Array of prediction errors.

  • xmin (float) – Minimum longitude value.

  • ymin (float) – Minimum latitude value.

  • xmax (float) – Maximum longitude value.

  • ymax (float) – Maximum latitude value.

  • buffer (float) – Buffer distance for geographical plotting.

Returns:

Array of grid x-coordinates. np.array: Array of grid y-coordinates. np.array: Array of error predictions. np.array: Array of error standard deviations.

Return type:

np.array

Notes

  • This method uses Ordinary Kriging to interpolate prediction errors across a geographical area.

  • It generates a grid of coordinates for interpolation and returns the interpolated error predictions and uncertainties.

_set_cbar_fontsize(cbar)[source]

Sets the font size for the colorbar labels.

Parameters:

cbar (matplotlib.colorbar.Colorbar) – The colorbar object whose font size is to be set.

Returns:

The colorbar object with updated font size.

Return type:

matplotlib.colorbar.Colorbar

Notes

  • This is a utility method for adjusting the appearance of colorbars in plots.

make_optuna_plots(study)[source]

Visualize Optuna search using built-in Optuna plotting methods.

This method generates a series of plots to visualize the results of an Optuna hyperparameter search.

Parameters:

study (optuna.study) – Optuna study to plot.

Raises:

Exception – If an error occurs during the plot generation process.

property obp

Get the output base path for the files.

property outdir

Get the output directory for the files.

property pfx

Get the prefix for the output files.

plot_bootstrap_aggregates(df)[source]

Make KDE and bar plots with bootstrap distributions and CIs.

This method creates two plots: one with kernel density estimates (KDEs) of the bootstrap distributions for each metric, and another with box plots showing the distribution of bootstrap samples for each metric.

Parameters:

df (pd.DataFrame) – The DataFrame containing the bootstrap samples for each metric.

plot_cumulative_error_distribution(data, fn, percentiles, median, mean)[source]

Generate an ECDF plot for the given data.

This method generates an Empirical Cumulative Distribution Function (ECDF) plot for the given data, with gradient fill based on the data values.

Parameters:
  • data (array-like) – The dataset for which the ECDF is to be plotted. Should be a 1-D array of prediction errors.

  • fn (str) – Output filename.

  • percentiles (np.ndarray) – 25th, 50th, and 75th percentiles of errors. Will be of shape (3,).

  • median (float) – Median of prediction errors.

  • mean (float) – Mean of prediction errors.

Returns:

The ECDF plot.

Return type:

matplotlib.figure.Figure

plot_data_distributions(train, val, test, is_target=False)[source]

Plot the distributions of the train, validation, and test datasets.

Parameters:
  • train (np.ndarray) – Training dataset.

  • val (np.ndarray) – Validation dataset.

  • test (np.ndarray) – Test dataset.

  • is_target (bool, optional) – Whether the data is a target. Defaults to False.

Raises:

ValueError – If the data is not a target and the shape of the arrays is not the same.

Notes

  • This method plots the distributions of the train, validation, and test datasets.

  • If the data is a target, the method plots the distributions of the longitude and latitude values.

  • If the data is not a target, the method plots the distributions of the genotypes.

plot_error_distribution(errors, outfile)[source]

Plot the distribution of errors using a histogram, box plot, and Q-Q plot.

Parameters:
  • errors (np.array) – An array of prediction errors.

  • outfile (str) – Output file path.

plot_gamma_distribution(shape, scale, Dg, sig_level, filename, plot_main)[source]

Plot the gamma distribution.

This method plots the gamma distribution with the given shape and scale parameters and highlights the critical value for the given significance level.

Parameters:
  • shape (float) – Shape parameter of the gamma distribution.

  • scale (float) – Scale parameter of the gamma distribution.

  • Dg (np.array) – Dg statistic for each sample.

  • sig_level (float) – Significance level (e.g., 0.05).

  • filename (str) – Name of the file to save the plot.

  • plot_main (str) – Title of the plot.

plot_geographic_error_distribution(actual_coords, predicted_coords, dataset, buffer=0.1, marker_scale_factor=2, min_colorscale=0, max_colorscale=300, n_contour_levels=20, centroids=None)[source]

Plots the geographic distribution of prediction errors and uncertainties.

This function calculates the Haversine error between actual and predicted coordinates and uses Gaussian Process Regression (GPR) to estimate error and uncertainty across a geographical area.

Parameters:
  • actual_coords (np.array) – Array of actual geographical coordinates.

  • predicted_coords (np.array) – Array of predicted geographical coordinates.

  • url (str) – URL for the shapefile to plot geographical data.

  • dataset (str) – Name of the dataset being used.

  • buffer (float, optional) – Buffer distance for geographical plotting. Defaults to 0.1.

  • marker_scale_factor (int, optional) – Scale factor for marker size in plots. Defaults to 2.

  • min_colorscale (int, optional) – Minimum value for the color scale. Defaults to 0.

  • max_colorscale (int, optional) – Maximum value for the color scale. Defaults to 300.

  • n_contour_levels (int, optional) – Number of contour levels in the plot. Defaults to 20.

  • centroids (np.array or geopandas.GeoDataFrame, optional) – Array of centroids to be plotted. Defaults to None.

Notes

  • This method produces two subplots: one showing the spatial distribution of prediction errors and the others showing the uncertainty of these predictions.

plot_history(train_loss, val_loss)[source]

Automatically plot training and validation loss with scaling.

This method generates a plot showing the training and validation loss over epochs. It automatically scales the y-axis to ensure that the plot is visually appealing.

Parameters:
  • train_loss (list) – List of training loss values.

  • val_loss (list) – List of validation loss values.

plot_mca_curve(explained_inertia, optimal_n)[source]

Plots the cumulative explained inertia as a function of the number of components in Multiple Correspondence Analysis (MCA).

This plot is useful for determining the optimal number of components to retain in MCA.

Parameters:
  • explained_inertia (array-like) – An array of cumulative explained inertia for each number of components.

  • optimal_n (int) – The optimal number of components determined for MCA.

Notes

  • The plot displays the explained inertia against the number of components.

  • A vertical line indicates the selected optimal number of components.

plot_nmf_error(errors, opt_n_components)[source]

Plots the reconstruction error as a function of the number of components in Non-negative Matrix Factorization (NMF).

This plot can be used to select the optimal number of components for NMF by identifying the point where additional components do not significantly decrease the error.

Parameters:
  • errors (list) – A list of NMF reconstruction errors for each number of components.

  • opt_n_components (int) – The optimal number of components selected for NMF.

Notes

  • The plot visualizes how the reconstruction error changes with the number of NMF components.

  • A vertical line indicates the selected optimal number of NMF components.

plot_outliers(mask, y_true)[source]

Plots a scatter plot to visualize identified outliers.

This method plots the geographical coordinates of the data points, highlighting the outliers in a different color and size.

Parameters:
  • mask (np.array) – A boolean array where ‘True’ indicates an outlier.

  • y_true (np.array) – Array of actual coordinates.

Notes

  • The function visualizes outliers on a geographical map, aiding in the identification of anomalous data points.

plot_pca_curve(x, vr, knee)[source]

Plots the cumulative explained variance as a function of the number of principal components in Principal Component Analysis (PCA).

This plot is helpful for determining the number of components to retain in PCA.

Parameters:
  • x (array-like) – An array representing the number of principal components.

  • vr (array-like) – An array of cumulative explained variance ratios for each number of components.

  • knee (int) – The ‘knee’ point, or the optimal number of components to retain in PCA.

Notes

  • The plot shows the cumulative explained variance against the number of principal components.

  • A vertical line at the ‘knee’ point helps in visually identifying the optimal number of components.

plot_sample_with_density(df, sample_id, df_known=None, dataset=None, gray_counties=None)[source]

Method to plot a sample with density contours.

This method calculates the density contours using KDE and plots them on a map.

Parameters:
  • df (pd.DataFrame) – DataFrame containing the sample data.

  • sample_id (str) – Sample ID.

  • df_known (pd.DataFrame, optional) – DataFrame containing known data. Defaults to None.

  • dataset (str, optional) – Name of the dataset. Defaults to None.

  • gray_counties (list, optional) – List of counties to highlight in gray. Defaults to

Raises:

TypeError – If the dataset argument is NoneType.

plot_scatter_samples_map(y_true_train, y_true, dataset, hue1=None, hue2=None)[source]

Plots geographical scatter plots of training and test/validation sample densities.

This method creates a subplot with two scatter plots, one showing the density of training samples and the other for test or validation samples.

Parameters:
  • y_true_train (np.array) – Array of actual geographical coordinates for the training dataset.

  • y_true (np.array) – Array of actual geographical coordinates for the test or validation dataset.

  • dataset (str) – Specifies whether the dataset is ‘test’ or ‘validation’.

  • hue1 (np.array, optional) – Array of hue values for the training dataset. Defaults to None.

  • hue2 (np.array, optional) – Array of hue values for the test or validation dataset. Defaults to None.

Notes

  • The method visualizes the geographical distribution of training and test/validation samples.

  • It uses scatter plots to represent the density of samples in different geographical areas.

  • The scatter plots are overlaid on top of a base map obtained from the specified shapefile URL.

plot_smote_bins(df, bins, df_orig, bins_orig)[source]

Plots scatter plots before and after oversampling.

This method visualizes the effect of oversampling on the data distribution. It creates a subplot with two scatter plots: one showing the original data and the other showing the data after SMOTE has been applied.

Parameters:
  • df (pandas DataFrame) – DataFrame containing the data after SMOTE oversampling.

  • bins (array-like) – Array of bin labels for the data after SMOTE.

  • df_orig (pandas DataFrame) – DataFrame containing the original data before SMOTE.

  • bins_orig (array-like) – Array of original bin labels before SMOTE.

Notes

  • This function visually compares the geographical distribution of data before and after SMOTE.

  • Each plot shows data points colored by their bin labels, providing insight into the oversampling process.

plot_times(rolling_avgs, rolling_stds, filename)[source]

Plot model training times.

This method generates a plot showing the rolling average of model training times over bootstrap replicates. It also includes the standard deviation of training times to provide insight into the variability of training durations.

Parameters:
  • rolling_avgs (list) – List of rolling average training times.

  • rolling_stds (list) – List of rolling standard deviations of training times.

  • filename (str) – Name of the file to save the plot.

Notes

  • This method visualizes the time taken to train models over bootstrap replicates.

  • It uses a line plot to show the rolling average of training times and includes a shaded region to represent the standard deviation of training times.

plot_zscores(z, fn)[source]

Plot Z-score histogram for prediction errors.

This method plots a histogram of Z-scores for prediction errors, with a gradient fill based on the Z-score values.

Parameters:
  • z (np.ndarray) – Array of Z-scores.

  • errors (np.ndarray) – Array of prediction errors.

  • fn (str) – Filename for the output plot.

polynomial_regression_plot(actual_coords, predicted_coords, dataset, degree=3, dtype=torch.float32, max_ylim=None, max_xlim=None, n_xticks=5)[source]

Creates a polynomial regression plot with the specified degree.

Parameters:
  • actual_coords (np.array) – Array of actual geographical coordinates.

  • predicted_coords (np.array) – Array of predicted geographical coordinates by the model.

  • dataset (str) – Specifies the dataset being used, should be either ‘test’ or ‘validation’.

  • degree (int) – Polynomial degree to fit. Defaults to 3.

  • dtype (torch.dtype) – PyTorch data type to use. Defaults to torch.float32.

  • max_ylim (int) – Maximum y-axis (prediction error) value to plot. Defaults to None (don’t adjust y-axis limits). Defaults to None.

  • max_xlim (float) – Maximum X-axis (sample density) value to plot. Defaults to None (don’t adjust x-axis limits). Defaults to None.

  • n_xticks (int) – Number of major X-axis ticks to use. Only applied if max_xlim is not None. Defaults to 5.

Raises:

ValueError – If the dataset parameter is not ‘test’ or does not start with ‘val’.

Notes

  • This function calculates the Haversine error for each pair of actual and predicted coordinates.

  • It then computes the KDE values for these errors and plots a regression to analyze the relationship.

update_config_labels(df)[source]

Update config labels in the dataframe based on the file patterns.

This method updates the configuration labels in the dataframe based on the file patterns used to generate the data.

Parameters:

df (pd.DataFrame) – The dataframe to be updated.

Returns:

The updated dataframe.

Return type:

pd.DataFrame

update_metric_labels(df)[source]

Update metric labels in the dataframe based on specified mappings.

This method updates the metric labels in the dataframe based on the mappings provided in the method.

Parameters:

df (pd.DataFrame) – The dataframe to be updated.

Returns:

The updated dataframe.

Return type:

pd.DataFrame

visualize_oversample_clusters(arr, cluster_labels, sample_origin_list)[source]

Visualize the genotypes and their clusters in a 2D scatter plot.

This method plots the genotypes and their clusters in a 2D scatter plot, coloring the points by cluster and shape by sample origin.

Parameters:
  • arr (np.ndarray) – Array to use for clustering.

  • cluster_labels (np.ndarray) – Cluster labels to use.

  • sample_origin_list – (list): List of sample origins (synthetic versus original).

Module contents