starlingrt package#
Submodules#
starlingrt.data module#
Structures to organize raw data into.
Author: Nathan A. Mahynski
- class starlingrt.data.Compound(number: int, rt: float, scan_number: int, area: int, baseline_height: int, absolute_height: int, peak_width: float)[source]#
Bases:
objectA compound is a peak in the GCMS output that has been detected and must be assigned to one or more library Hits.
Each peak (Compound) in the MSRep.xls file > LibRes tab is assigned various Hits.
- absolute_height: ClassVar[int]#
- area: ClassVar[int]#
- baseline_height: ClassVar[int]#
- number: ClassVar[int]#
- peak_width: ClassVar[float]#
- rt: ClassVar[float]#
- scan_number: ClassVar[int]#
- class starlingrt.data.Entry(sample_filename: str, compound_number: int, rt: int, scan_number: int, area: int, baseline_height: int, absolute_height: int, peak_width: float, hit_number: int, hit_name: str, quality: int, mol_weight: float, cas_number: str, library: str, entry_number_library: int)[source]#
Bases:
objectCreate an Entry.
This is essentially a combination of Hit and Compound intended to “unroll” their information into a flat data structure more amenable for searching.
- absolute_height: ClassVar[int]#
- area: ClassVar[int]#
- baseline_height: ClassVar[int]#
- cas_number: ClassVar[str]#
- compound_number: ClassVar[int]#
- entry_number_library: ClassVar[int]#
- hit_name: ClassVar[str]#
- hit_number: ClassVar[int]#
- library: ClassVar[str]#
- mol_weight: ClassVar[float]#
- peak_width: ClassVar[float]#
- quality: ClassVar[int]#
- rt: ClassVar[int]#
- sample_filename: ClassVar[str]#
- scan_number: ClassVar[int]#
- class starlingrt.data.Hit(number: int, name: str, quality: int, mol_weight: float, cas_number: str, library: str, entry_number_library: int)[source]#
Bases:
objectA possible assignment to a peak from the library in use.
Each peak (Compound) in the MSRep.xls file > LibRes tab is assigned various Hits.
- cas_number: ClassVar[str]#
- entry_number_library: ClassVar[int]#
- library: ClassVar[str]#
- mol_weight: ClassVar[float]#
- name: ClassVar[str]#
- number: ClassVar[int]#
- quality: ClassVar[int]#
- class starlingrt.data.Utilities[source]#
Bases:
objectUtility functions for manipulating data structures.
- static create_entries(samples: list) dict[str, starlingrt.data.Entry][source]#
Extract all Entry from samples.
- Parameters:
samples (list(_SampleBase)) – List of Samples collected from all directories in input_directory.
- Returns:
total_entries – Dictionary of all Entry in samples whose keys are sha1 hashes and values are Entry objects.
- Return type:
dict(str, Entry)
- static group_entries_by_name(entries: dict[str, starlingrt.data.Entry]) dict[str, list[tuple[starlingrt.data.Entry, str]]][source]#
Group entries with the same hit name.
- static group_entries_by_rt(entries: dict[str, starlingrt.data.Entry]) dict[float, list[starlingrt.data.Entry]][source]#
Group entries with the same retention time.
- static select_top_entries(total_entries: dict[str, starlingrt.data.Entry]) dict[str, starlingrt.data.Entry][source]#
Trim down the entries to just have the top (quality) hits (i.e., hit_number == 1).
starlingrt.functions module#
Functions to manipulate and compute properties of GCMS data.
Author: Nathan A. Mahynski
- starlingrt.functions.assign_suggestions(df, rt_groups, suggested_name, ties)[source]#
Unroll the suggestions and assign them to the dataframe.
- Parameters:
df (pd.DataFrame) – DataFrame of retention times (see get_dataframe).
rt_groups (list(tuple)) – List of tuples of (index, hit_name, quality, rt) by group (see group_by_rt_step).
suggested_name (list) – Suggested name of each group (see suggest_names).
ties (dict) – Dictionary of ties (see suggest_names).
- Returns:
dataframe – DataFrame with “suggested_name” and “flag” columns added.
- Return type:
pd.DataFrame
- starlingrt.functions.closest_rt(rt: float, df_iqr: DataFrame) str[source]#
Suggest the (visible) species with the closest median retention time.
- Parameters:
rt (float) – Retention time targeted.
df_iqr (pd.DataFrame) – DataFrame summarizing the IQR and whisker bounds for the name_groups.
- Returns:
guess – Name of the Hit with the closes retention time.
- Return type:
str
- starlingrt.functions.estimate_threshold(df: DataFrame, thresholds: ndarray[Any, dtype[_ScalarType_co]] = array([1.00000000e-04, 1.12332403e-04, 1.26185688e-04, 1.41747416e-04, 1.59228279e-04, 1.78864953e-04, 2.00923300e-04, 2.25701972e-04, 2.53536449e-04, 2.84803587e-04, 3.19926714e-04, 3.59381366e-04, 4.03701726e-04, 4.53487851e-04, 5.09413801e-04, 5.72236766e-04, 6.42807312e-04, 7.22080902e-04, 8.11130831e-04, 9.11162756e-04, 1.02353102e-03, 1.14975700e-03, 1.29154967e-03, 1.45082878e-03, 1.62975083e-03, 1.83073828e-03, 2.05651231e-03, 2.31012970e-03, 2.59502421e-03, 2.91505306e-03, 3.27454916e-03, 3.67837977e-03, 4.13201240e-03, 4.64158883e-03, 5.21400829e-03, 5.85702082e-03, 6.57933225e-03, 7.39072203e-03, 8.30217568e-03, 9.32603347e-03, 1.04761575e-02, 1.17681195e-02, 1.32194115e-02, 1.48496826e-02, 1.66810054e-02, 1.87381742e-02, 2.10490414e-02, 2.36448941e-02, 2.65608778e-02, 2.98364724e-02, 3.35160265e-02, 3.76493581e-02, 4.22924287e-02, 4.75081016e-02, 5.33669923e-02, 5.99484250e-02, 6.73415066e-02, 7.56463328e-02, 8.49753436e-02, 9.54548457e-02, 1.07226722e-01, 1.20450354e-01, 1.35304777e-01, 1.51991108e-01, 1.70735265e-01, 1.91791026e-01, 2.15443469e-01, 2.42012826e-01, 2.71858824e-01, 3.05385551e-01, 3.43046929e-01, 3.85352859e-01, 4.32876128e-01, 4.86260158e-01, 5.46227722e-01, 6.13590727e-01, 6.89261210e-01, 7.74263683e-01, 8.69749003e-01, 9.77009957e-01, 1.09749877e+00, 1.23284674e+00, 1.38488637e+00, 1.55567614e+00, 1.74752840e+00, 1.96304065e+00, 2.20513074e+00, 2.47707636e+00, 2.78255940e+00, 3.12571585e+00, 3.51119173e+00, 3.94420606e+00, 4.43062146e+00, 4.97702356e+00, 5.59081018e+00, 6.28029144e+00, 7.05480231e+00, 7.92482898e+00, 8.90215085e+00, 1.00000000e+01]), display: bool = False) float[source]#
Estimate the minimum gap (threshold) to separate groups from each other.
- Parameters:
df (pd.DataFrame) – DataFrame of retention times for selected target and any selected neighbors (see get_dataframe).
thresholds (ndarray(float, ndim=1), optional(default=np.logspace(-4, 1, 100))) – Sequence of thresholds to try.
display (bool, optional(default=False)) – Whether or not to display the results visually.
- Returns:
threshold – Choice of threshold.
- Return type:
float
- starlingrt.functions.flag_entry_rt(entries: list[tuple['starlingrt.data.Entry', str]], min_entries: int = 10, k: float = 3.0, cv: int = 5, style: str = 'classical') ndarray[Any, dtype[bool_]][source]#
Flag entries with anomolous retention times based on the group’s consensus.
If a point is considered an outlier in any single fold, it is flagged.
- Parameters:
entries (list(tuple(Entry, str))) – List of entries (e.g., grouped by name) to examine.
min_entries (int, optional(default=10)) – Minimum length of entries to do KFold cross validation (CV), otherwise ignore cv and do LOOCV is instead.
k (float, optional(default=3.0)) – Number of standard deviations from center allowed. k = 1.5 is more appropriate if using the robust approach based on the IQR as a measure of “spread”; k = 3 is more appropriate for classical, but can be reasonable for the robust approach as well.
cv (int, optional(default=5)) – Number of folds to use in cross-validation. If len(entries) > cv, LOOCV is used instead.
style (str, optional(default="classical")) – When classical use mean and std vs. robust which uses median and iqr for center and spread, respectively. Inliers are considered those for which: center - k*spread < x < center + k*spread. The rest are flagged.
- Returns:
mask – Mask of outliers corresponding to the ordering in entries.
- Return type:
ndarray(bool)
- starlingrt.functions.get_dataframe(entries: dict, target: str | None = None, pm: int = 0) tuple['pd.DataFrame', 'pd.api.typing.DataFrameGroupBy', list][source]#
Get dataframe centered on a target.
- Parameters:
entries (dict(str, Entry)) – Dictionary of Entry whose keys are sha1 hashes and values are Entry objects.
target (str, optional(default=None)) – Name of retention time group to target.
pm (int, optional(default=0)) – Number of neighbors around target to select.
- Returns:
results (pd.DataFrame) – DataFrame of retention times for selected target and any selected neighbors.
name_groups (pd.api.typing.DataFrameGroupBy) – Pandas GroupBy object containing the entries grouped by Hit name.
order_cats_used (list) – Ordered list of names of the groups selected.
- starlingrt.functions.get_quantiles_df(name_groups: DataFrameGroupBy) DataFrame[source]#
Get the 0.25, 0.50, and 0.75 percentiles of the groups.
- Parameters:
name_groups (pd.api.typing.DataFrameGroupBy) – Pandas GroupBy object containing the entries grouped by Hit name.
- Returns:
dataframe – DataFrame summarizing the IQR and whisker bounds for the name_groups.
- Return type:
pd.DataFrame
- starlingrt.functions.group_by_rt_step(df: DataFrame, threshold: float = 0.04) list[list[tuple[Any, str, float, float]]][source]#
Create groups based on similar retention times.
This algorithm simply sorts based on retention time, then creates a new group when a gap between consecutive (sorted) points exceeds the threshold.
- Parameters:
df (pd.DataFrame) – DataFrame of retention times (see get_dataframe).
threshold (float, optional(default=0.04)) – Minimum retention time gap between consecutive compounds to be resolved as different.
- Returns:
rt_groups – List of tuples of (index, hit_name, quality, rt) by group.
- Return type:
list(list(tuples))
- starlingrt.functions.make_dataframe(entries: dict) DataFrame[source]#
Create a dataframe out of the entries.
- Parameters:
entries (dict(str, Entry)) – Dictionary of Entry whose keys are sha1 hashes and values are Entry objects.
- Returns:
dataframe – DataFrame of Entries sorted by Hit name.
- Return type:
pd.DataFrame
- starlingrt.functions.make_histograms(by_name: dict[str, list[tuple['starlingrt.data.Entry', str]]], k_values: ndarray[Any, dtype[floating]], bins: int = 10, cv: int = 3, style: str = 'robust', min_entries: int = 5) tuple[dict, dict, dict][source]#
Make histograms of retention times for each compound by name.
- Parameters:
by_name (dict(str, list(tuple(Entry, str)))) – Dictionary of Entry whose keys are hit names and values are tuples of (Entry, hash).
k_values (array-like) – List of k values to use to flag “outlying” retention times.
bins (int) – Number of histogram bins to use for retention times.
cv (int, optional(default=3)) – Number of folds to use in cross-validation.
style (str, optional(default="classical")) – When classical use mean and std vs. robust which uses median and iqr for center and spread, respectively. Inliers are considered those for which: center - k*spread < x < center + k*spread. The rest are flagged.
min_entries (int, optional(default=5)) – Minimum length of entries to do KFold cross validation (CV), otherwise ignore cv and do LOOCV is instead.
- Returns:
histograms (dict(str, list)) – Values of the histogram for each compound.
bin_edges (dict(str, list)) – Bin edges for the histogram of each compound.
points (dict(str, dict(str, dict(str, dict(str, list))))) – Nested dictionary of points which are of concern or not of concern for each k value; e.g., points[‘methane’][‘3.0’][‘concern’] = {‘x’: rention_times, ‘y’: staggered_bin_counts}.
- starlingrt.functions.suggest_names(rt_groups: list[list[tuple[int, str, float, float]]]) tuple[list, dict, list][source]#
Suggest the best name for group compounds with similar retention times.
Computes a “probability” using the quality of each observation in a group to determine the most likely name.
- Parameters:
rt_groups (list(tuple)) – List of tuples of (index, hit_name, quality, rt) by group (see group_by_rt_step).
- Returns:
suggested_name (list) – Suggested name of each group.
ties (dict) – Dictionary of ties.
entropy (list) – Entropy of each group.
starlingrt.sample module#
Structures to store samples from different mass spectrometers.
Author: Nathan A. Mahynski
starlingrt.visualize module#
Visualize GCMS data to determine consensus and any corrections necessary.
Author: Nathan A. Mahynski
- starlingrt.visualize.make(top_entries: dict[str, starlingrt.data.Entry], width: int, threshold: float, output_filename: str = 'summary.html') None[source]#
Make the interactive HTML document for users to inspect.
- Parameters:
top_entries (dict(str, Entry)) – Top entries (hit number = 1) labeled by their sha1 hash.
width (int) – Width of the HTML table output.
threshold (float) – Minimum retention time gap between consecutive compounds to be resolved as different.
output_filename (str, optional(default="summary.html")) – Name of HTML file to save results to.
Module contents#
Starlingrt module.
Author: Nathan A. Mahynski