starlingrt package

starlingrt package#

Submodules#

starlingrt.data module#

Structures to organize raw data into.

Author: Nathan A. Mahynski

class starlingrt.data.Compound(number: int, rt: float, scan_number: int, area: int, baseline_height: int, absolute_height: int, peak_width: float)[source]#

Bases: object

A compound is a peak in the GCMS output that has been detected and must be assigned to one or more library Hits.

Each peak (Compound) in the MSRep.xls file > LibRes tab is assigned various Hits.

absolute_height: ClassVar[int]#

area: ClassVar[int]#

baseline_height: ClassVar[int]#

get_params() → dict[str, Any][source]#: Get parameters.

number: ClassVar[int]#

peak_width: ClassVar[float]#

rt: ClassVar[float]#

scan_number: ClassVar[int]#

set_params(**parameters: Any) → Compound[source]#: Set parameters.

class starlingrt.data.Entry(sample_filename: str, compound_number: int, rt: int, scan_number: int, area: int, baseline_height: int, absolute_height: int, peak_width: float, hit_number: int, hit_name: str, quality: int, mol_weight: float, cas_number: str, library: str, entry_number_library: int)[source]#

Bases: object

Create an Entry.

This is essentially a combination of Hit and Compound intended to “unroll” their information into a flat data structure more amenable for searching.

absolute_height: ClassVar[int]#

area: ClassVar[int]#

baseline_height: ClassVar[int]#

cas_number: ClassVar[str]#

compound_number: ClassVar[int]#

entry_number_library: ClassVar[int]#

get_params() → dict[str, Any][source]#: Get parameters.

hit_name: ClassVar[str]#

hit_number: ClassVar[int]#

library: ClassVar[str]#

mol_weight: ClassVar[float]#

peak_width: ClassVar[float]#

quality: ClassVar[int]#

rt: ClassVar[int]#

sample_filename: ClassVar[str]#

scan_number: ClassVar[int]#

set_params(**parameters: Any) → Entry[source]#: Set parameters.

class starlingrt.data.Hit(number: int, name: str, quality: int, mol_weight: float, cas_number: str, library: str, entry_number_library: int)[source]#

Bases: object

A possible assignment to a peak from the library in use.

Each peak (Compound) in the MSRep.xls file > LibRes tab is assigned various Hits.

cas_number: ClassVar[str]#

entry_number_library: ClassVar[int]#

get_params() → dict[str, Any][source]#: Get parameters.

library: ClassVar[str]#

mol_weight: ClassVar[float]#

name: ClassVar[str]#

number: ClassVar[int]#

quality: ClassVar[int]#

set_params(**parameters: Any) → Hit[source]#: Set parameters.

class starlingrt.data.Utilities[source]#

Bases: object

Utility functions for manipulating data structures.

static create_entries(samples: list) → dict[str, starlingrt.data.Entry][source]#

Extract all Entry from samples.

Parameters:: samples (list(_SampleBase)) – List of Samples collected from all directories in input_directory.
Returns:: total_entries – Dictionary of all Entry in samples whose keys are sha1 hashes and values are Entry objects.
Return type:: dict(str, Entry)

static group_entries_by_name(entries: dict[str, starlingrt.data.Entry]) → dict[str, list[tuple[starlingrt.data.Entry, str]]][source]#

Group entries with the same hit name.

Parameters:: entries (dict(str, Entry)) – Dictionary of Entry whose keys are sha1 hashes and values are Entry objects.
Returns:: groups – Dictionary of Entry whose keys are hit names and values are tuples of (Entry objects, hash).
Return type:: dict(str, list(tuple(Entry, str)))

static group_entries_by_rt(entries: dict[str, starlingrt.data.Entry]) → dict[float, list[starlingrt.data.Entry]][source]#

Group entries with the same retention time.

Parameters:: entries (dict(str, Entry)) – Dictionary of Entry whose keys are sha1 hashes and values are Entry objects.
Returns:: groups – Dictionary of Entry whose keys are retention times and values are Entry objects.
Return type:: dict(float, list(Entry))

static select_top_entries(total_entries: dict[str, starlingrt.data.Entry]) → dict[str, starlingrt.data.Entry][source]#

Trim down the entries to just have the top (quality) hits (i.e., hit_number == 1).

Parameters:: total_entries (dict(str, Entry)) – Dictionary of all Entry in samples whose keys are sha1 hashes.
Returns:: top_entries – Dictionary of all Entry with hit_number == 1 whose keys are sha1 hashes and values are Entry objects.
Return type:: dict(str, Entry)

starlingrt.functions module#

Functions to manipulate and compute properties of GCMS data.

Author: Nathan A. Mahynski

starlingrt.functions.assign_suggestions(df, rt_groups, suggested_name, ties)[source]#

Unroll the suggestions and assign them to the dataframe.

Parameters:

df (pd.DataFrame) – DataFrame of retention times (see get_dataframe).
rt_groups (list(tuple)) – List of tuples of (index, hit_name, quality, rt) by group (see group_by_rt_step).
suggested_name (list) – Suggested name of each group (see suggest_names).
ties (dict) – Dictionary of ties (see suggest_names).

Returns:

dataframe – DataFrame with “suggested_name” and “flag” columns added.

Return type:

pd.DataFrame

starlingrt.functions.closest_rt(rt: float, df_iqr: DataFrame) → str[source]#

Suggest the (visible) species with the closest median retention time.

Parameters:

rt (float) – Retention time targeted.
df_iqr (pd.DataFrame) – DataFrame summarizing the IQR and whisker bounds for the name_groups.

Returns:

guess – Name of the Hit with the closes retention time.

Return type:

str

starlingrt.functions.estimate_threshold(df: DataFrame, thresholds: ndarray[Any, dtype[_ScalarType_co]] = array([1.00000000e-04, 1.12332403e-04, 1.26185688e-04, 1.41747416e-04, 1.59228279e-04, 1.78864953e-04, 2.00923300e-04, 2.25701972e-04, 2.53536449e-04, 2.84803587e-04, 3.19926714e-04, 3.59381366e-04, 4.03701726e-04, 4.53487851e-04, 5.09413801e-04, 5.72236766e-04, 6.42807312e-04, 7.22080902e-04, 8.11130831e-04, 9.11162756e-04, 1.02353102e-03, 1.14975700e-03, 1.29154967e-03, 1.45082878e-03, 1.62975083e-03, 1.83073828e-03, 2.05651231e-03, 2.31012970e-03, 2.59502421e-03, 2.91505306e-03, 3.27454916e-03, 3.67837977e-03, 4.13201240e-03, 4.64158883e-03, 5.21400829e-03, 5.85702082e-03, 6.57933225e-03, 7.39072203e-03, 8.30217568e-03, 9.32603347e-03, 1.04761575e-02, 1.17681195e-02, 1.32194115e-02, 1.48496826e-02, 1.66810054e-02, 1.87381742e-02, 2.10490414e-02, 2.36448941e-02, 2.65608778e-02, 2.98364724e-02, 3.35160265e-02, 3.76493581e-02, 4.22924287e-02, 4.75081016e-02, 5.33669923e-02, 5.99484250e-02, 6.73415066e-02, 7.56463328e-02, 8.49753436e-02, 9.54548457e-02, 1.07226722e-01, 1.20450354e-01, 1.35304777e-01, 1.51991108e-01, 1.70735265e-01, 1.91791026e-01, 2.15443469e-01, 2.42012826e-01, 2.71858824e-01, 3.05385551e-01, 3.43046929e-01, 3.85352859e-01, 4.32876128e-01, 4.86260158e-01, 5.46227722e-01, 6.13590727e-01, 6.89261210e-01, 7.74263683e-01, 8.69749003e-01, 9.77009957e-01, 1.09749877e+00, 1.23284674e+00, 1.38488637e+00, 1.55567614e+00, 1.74752840e+00, 1.96304065e+00, 2.20513074e+00, 2.47707636e+00, 2.78255940e+00, 3.12571585e+00, 3.51119173e+00, 3.94420606e+00, 4.43062146e+00, 4.97702356e+00, 5.59081018e+00, 6.28029144e+00, 7.05480231e+00, 7.92482898e+00, 8.90215085e+00, 1.00000000e+01]), display: bool = False) → float[source]#

Estimate the minimum gap (threshold) to separate groups from each other.

Parameters:

df (pd.DataFrame) – DataFrame of retention times for selected target and any selected neighbors (see get_dataframe).
thresholds (ndarray(float, ndim=1), optional(default=np.logspace(-4, 1, 100))) – Sequence of thresholds to try.
display (bool, optional(default=False)) – Whether or not to display the results visually.

Returns:

threshold – Choice of threshold.

Return type:

float

starlingrt.functions.flag_entry_rt(entries: list[tuple['starlingrt.data.Entry', str]], min_entries: int = 10, k: float = 3.0, cv: int = 5, style: str = 'classical') → ndarray[Any, dtype[bool_]][source]#

Flag entries with anomolous retention times based on the group’s consensus.

If a point is considered an outlier in any single fold, it is flagged.

Parameters:

entries (list(tuple(Entry, str))) – List of entries (e.g., grouped by name) to examine.
min_entries (int, optional(default=10)) – Minimum length of entries to do KFold cross validation (CV), otherwise ignore cv and do LOOCV is instead.
k (float, optional(default=3.0)) – Number of standard deviations from center allowed. k = 1.5 is more appropriate if using the robust approach based on the IQR as a measure of “spread”; k = 3 is more appropriate for classical, but can be reasonable for the robust approach as well.
cv (int, optional(default=5)) – Number of folds to use in cross-validation. If len(entries) > cv, LOOCV is used instead.
style (str, optional(default="classical")) – When classical use mean and std vs. robust which uses median and iqr for center and spread, respectively. Inliers are considered those for which: center - k*spread < x < center + k*spread. The rest are flagged.

Returns:

mask – Mask of outliers corresponding to the ordering in entries.

Return type:

ndarray(bool)

starlingrt.functions.get_dataframe(entries: dict, target: str | None = None, pm: int = 0) → tuple['pd.DataFrame', 'pd.api.typing.DataFrameGroupBy', list][source]#

Get dataframe centered on a target.

Parameters:

entries (dict(str, Entry)) – Dictionary of Entry whose keys are sha1 hashes and values are Entry objects.
target (str, optional(default=None)) – Name of retention time group to target.
pm (int, optional(default=0)) – Number of neighbors around target to select.

Returns:

results (pd.DataFrame) – DataFrame of retention times for selected target and any selected neighbors.
name_groups (pd.api.typing.DataFrameGroupBy) – Pandas GroupBy object containing the entries grouped by Hit name.
order_cats_used (list) – Ordered list of names of the groups selected.

starlingrt.functions.get_quantiles_df(name_groups: DataFrameGroupBy) → DataFrame[source]#

Get the 0.25, 0.50, and 0.75 percentiles of the groups.

Parameters:: name_groups (pd.api.typing.DataFrameGroupBy) – Pandas GroupBy object containing the entries grouped by Hit name.
Returns:: dataframe – DataFrame summarizing the IQR and whisker bounds for the name_groups.
Return type:: pd.DataFrame

starlingrt.functions.group_by_rt_step(df: DataFrame, threshold: float = 0.04) → list[list[tuple[Any, str, float, float]]][source]#

Create groups based on similar retention times.

This algorithm simply sorts based on retention time, then creates a new group when a gap between consecutive (sorted) points exceeds the threshold.

Parameters:

df (pd.DataFrame) – DataFrame of retention times (see get_dataframe).
threshold (float, optional(default=0.04)) – Minimum retention time gap between consecutive compounds to be resolved as different.

Returns:

rt_groups – List of tuples of (index, hit_name, quality, rt) by group.

Return type:

list(list(tuples))

starlingrt.functions.make_dataframe(entries: dict) → DataFrame[source]#

Create a dataframe out of the entries.

Parameters:: entries (dict(str, Entry)) – Dictionary of Entry whose keys are sha1 hashes and values are Entry objects.
Returns:: dataframe – DataFrame of Entries sorted by Hit name.
Return type:: pd.DataFrame

starlingrt.functions.make_histograms(by_name: dict[str, list[tuple['starlingrt.data.Entry', str]]], k_values: ndarray[Any, dtype[floating]], bins: int = 10, cv: int = 3, style: str = 'robust', min_entries: int = 5) → tuple[dict, dict, dict][source]#

Make histograms of retention times for each compound by name.

Parameters:

by_name (dict(str, list(tuple(Entry, str)))) – Dictionary of Entry whose keys are hit names and values are tuples of (Entry, hash).
k_values (array-like) – List of k values to use to flag “outlying” retention times.
bins (int) – Number of histogram bins to use for retention times.
cv (int, optional(default=3)) – Number of folds to use in cross-validation.
style (str, optional(default="classical")) – When classical use mean and std vs. robust which uses median and iqr for center and spread, respectively. Inliers are considered those for which: center - k*spread < x < center + k*spread. The rest are flagged.
min_entries (int, optional(default=5)) – Minimum length of entries to do KFold cross validation (CV), otherwise ignore cv and do LOOCV is instead.

Returns:

histograms (dict(str, list)) – Values of the histogram for each compound.
bin_edges (dict(str, list)) – Bin edges for the histogram of each compound.
points (dict(str, dict(str, dict(str, dict(str, list))))) – Nested dictionary of points which are of concern or not of concern for each k value; e.g., points[‘methane’][‘3.0’][‘concern’] = {‘x’: rention_times, ‘y’: staggered_bin_counts}.

starlingrt.functions.suggest_names(rt_groups: list[list[tuple[int, str, float, float]]]) → tuple[list, dict, list][source]#

Suggest the best name for group compounds with similar retention times.

Computes a “probability” using the quality of each observation in a group to determine the most likely name.

Parameters:

rt_groups (list(tuple)) – List of tuples of (index, hit_name, quality, rt) by group (see group_by_rt_step).

Returns:

suggested_name (list) – Suggested name of each group.
ties (dict) – Dictionary of ties.
entropy (list) – Entropy of each group.

starlingrt.sample module#

Structures to store samples from different mass spectrometers.

Author: Nathan A. Mahynski

class starlingrt.sample.MassHunterSample(filename: str)[source]#

Bases: _SampleBase

Class to store the MSRep.xls output from MassHunter(TM).

read(filename: str) → None[source]#

Read data from MSRep.xls file.

This assumes a specific formatted output from MassHunter(TM) which is checked below.

Parameters:: filename (str) – Pathname of MSRep.xls file.

starlingrt.visualize module#

Visualize GCMS data to determine consensus and any corrections necessary.

Author: Nathan A. Mahynski

starlingrt.visualize.make(top_entries: dict[str, starlingrt.data.Entry], width: int, threshold: float, output_filename: str = 'summary.html') → None[source]#

Make the interactive HTML document for users to inspect.

Parameters:

top_entries (dict(str, Entry)) – Top entries (hit number = 1) labeled by their sha1 hash.
width (int) – Width of the HTML table output.
threshold (float) – Minimum retention time gap between consecutive compounds to be resolved as different.
output_filename (str, optional(default="summary.html")) – Name of HTML file to save results to.

Module contents#

Starlingrt module.

Author: Nathan A. Mahynski

starlingrt package

Contents

starlingrt package#

Submodules#

starlingrt.data module#

starlingrt.functions module#

starlingrt.sample module#

starlingrt.visualize module#

Module contents#