API reference#
modality.ContigDataset#
- class modality.ContigDataset(ds: Dataset | None = None, data_vars: dict | None = None, coords: dict | None = None, attrs: dict | None = None)#
Methylation context-specific xarray-like dataset.
A wrapper around an xarray.Dataset instance that extends its functionality.
- __getitem__(key)#
- Allows:
– access by contig and positions, e.g. ds[“chr1:1000-2000”] – access by contig, e.g. ds[“chr1”] – access by usual xarray rows, e.g. ds[1000:2000]
- Parameters:
key – the key to be accessed.
- Returns:
The slice of the dataset corresponding to the key.
- append(value)#
S.append(value) – append value to the end of the sequence
- assign_fractions(numerators: str | list, denominator: str, min_coverage: int, inplace: bool = False)#
Deprecated: Users can simply call the frac_x properties (e.g., frac_c, frac_mc) without needing to call assign_fractions first. Coverage and denominator can be updated using set_fraction_settings.
Add columns to the ContigDataset with the fractions of each of the numerators columns to denominator, named after the pattern “frac_[numerator column]”.
The parameters are stored in the ContigDataset attributes as “frac_denominator” and “frac_min_coverage”.
- Parameters:
numerators – the columns of which fractions to the denominator to be calculated e.g. “num_mc”, “num_modc”
denominator – the column to be used as the denominator of the fractions, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the fraction denominator required for a position to be included in the fraction calculation. positions with coverage below min_coverage are set to np.nan
inplace – boolean of whether to modify the object in place
- Returns:
copy of self with added fraction columns or None if self is modified inplace
- at_mincoverage(minimum=20)#
Select a subset of positions from the dataset that meet a minimum coverage threshold.
- Parameters:
minimum (int or float) – the minimum coverage threshold.
- Returns:
the ContigDataset instance with the selected positions.
- Return type:
self
Note: The num_total field is used to determine coverage. Use subset_bycoverage for greater flexibility.
- clear() None -- remove all items from S #
- compute_correlation_matrix(numerator: str, denominator: str, min_coverage: int)#
Compute a correlation matrix across samples.
- Parameters:
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
- Returns:
The correlation matrix.
- compute_pearson_matrix(numerator: str, denominator: str, min_coverage: int)#
Compute a Pearson correlation matrix. This function calculates the Pearson correlation coefficient between the methylation fractions across samples.
- Parameters:
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
- Returns:
The matrix of Pearson correlation coefficients.
- property context: str#
Returns the context of the dataset.
- count(value) integer -- return number of occurrences of value #
- extend(values)#
S.extend(iterable) – extend sequence by appending elements from the iterable
- property frac_c: DataArray#
Fraction of Cs to total Cs.
- property frac_hmc#
Fraction of hmCs to total Cs.
- property frac_mc#
Fraction of mCs to total Cs.
- property frac_modc#
Fraction of modified Cs to total Cs.
- classmethod from_zarrz(fpath: str, **kwargs) ContigDataset #
Load the zipped zarr store or zarr store directory in fpath as a ContigDataset.
See: https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html
- Parameters:
fpath – the path to the zarr store
kwargs – the keyword arguments to be passed to xr.open_zarr
- Returns:
the loaded ContigDataset
- granges2rows(ranges)#
Return equivalent genomic ranges encoded as DataArray row indices.
- Parameters:
ranges (PyRanges) – the ranges, as a PyRanges object.
- Returns:
the equivalent genomic ranges encoded as DataArray row indices.
- Return type:
result (np.array)
- groupby(group) ContigDatasetGroupBy #
Groups the dataset by the specified group(s), ensuring that all group keys are valid and along the same dimension, and names the resulting dimension based on the group keys.
- Parameters:
ds – The dataset to apply the grouping to.
group (str or list[str]) – The coordinate(s) or variable(s) to group by.
- Returns:
The grouped dataset ready for aggregation.
- Return type:
- Raises:
ValueError – If any of the group keys are invalid or if the group keys are along different dimensions (axes).
- property grouped_on: str#
Returns the list of group keys used for grouping the dataset.
- hstack(rest: ContigDataset | list[ContigDataset], combine_attrs: Callable | None = None, **kwargs) ContigDataset #
Stack ContigDatasets by sample_id.
See also: https://docs.xarray.dev/en/stable/generated/xarray.concat.html
- Parameters:
rest – the ContigDataset(s) to be stacked with self
combine_attrs – the function to be used to combine the attributes
**kwargs – the keyword arguments to be passed to xr.concat
- Returns:
the generated stacked ContigDataset
- index(value[, start[, stop]]) integer -- return first index of value. #
Raises ValueError if the value is not present.
Supporting start and stop arguments is optional, but recommended.
- insert()#
S.insert(index, value) – insert value before index
- merge_annotation(annotation, pos='ref_position')#
Merge the position-based annotation data with the ContigDataset.
- Parameters:
annotation (str) – the path to the annotation data.
pos (str) – the position column to be used for merging.
- Returns:
the ContigDataset instance with the annotation data merged.
- Return type:
self (contig_dataset.ContigDataset)
- merge_meta(metadata: DataFrame | str | Path, **kwargs)#
Add additional coordinates to the dataset based on per-sample metadata. A sample_id column is required
- Parameters:
metadata – Pandas DataFrame containing the metadata or path to the metadata file.
coord_columns (list) – the columns to be merged.
kwargs – additional keyword arguments to be passed to the read_csv function.
- Returns:
the ContigDataset instance with the additional coordinates.
- Return type:
self (contig_dataset.ContigDataset)
- property num_contexts: int#
Returns the number of contexts in the dataset.
- plot_correlation_heatmaps(numerator: str, denominator: str, min_coverage: int, title: str | None = None, **kwargs)#
Plot a correlation across samples, as a heatmap of the methylation fractions, for each pair of samples.
- Parameters:
var – The variable to be plotted.
min_coverage – The minimum coverage.
**kwargs – The keyword arguments passed to plt.subplots()
- Returns:
The figure object. axes: The axes object.
- Return type:
fig
- plot_correlation_scatter(numerator: str, denominator: str, min_coverage: int, **kwargs)#
Plot the correlation across samples, as a scatter plot of the methylation fractions for each pair of samples.
- Parameters:
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
**kwargs – The keyword arguments passed to plt.subplots().
- Returns:
The figure and axes objects.
- plot_coverage_distribution(var='num_total_c', bins=None, norm_hist=False, num_max_tick_labels=10, return_data=False, **kwargs)#
Plot the distribution of coverage across samples.
- Parameters:
var – the variable to plot. Most likely one of [“num_total_c”].
bins – the number of bins or an array of bins. IMPORTANT: the auto-binning may not return an array of bin edges with the desired number of bins. If you want full control over number of bins, you should pass an array of bin edges.
norm_hist – whether to normalise the histogram
num_max_tick_labels – the maximum number of tick labels on the x-axis.
return_data – whether to return the data used to plot the histogram
**kwargs – the keyword arguments passed to sns.lineplot()
- plot_histogram(var='frac_mc', bins=26, norm_hist=False, num_max_tick_labels=7, return_data=False, **kwargs)#
Plot a histogram from the data in the ContigDataset.
- Parameters:
var – the variable to plot
bins – the number of bins or an array of bins. IMPORTANT: the auto-binning may not return an array of bin edges with the desired number of bins. If you want full control over number of bins, you should pass an array of bin edges.
norm_hist – whether to normalise the histogram
num_max_tick_labels – the maximum number of tick labels on the x-axis.
return_data – whether to return the data used to plot the histogram
**kwargs – the keyword arguments passed to sns.catplot()
- Returns:
the catplot object
- Return type:
g
- plot_methylation_tile(numerator: str, denominator: str, min_coverage: int, sample_label: str = 'sample_id', position_covariate=None, title: str | None = 'auto', **kwargs)#
Plot a methylation tile plot. This plot is useful for displaying structure present in the cohort. This plot is intended for small regions of the genome, as it will load all the data into memory. One can access such small region by subsetting the dataset, e.g., doing ds[f”{chr}:{start}-{end}”] or calling subset_byranges with a single range.
- Parameters:
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
sample_label – the label to be used for the sample if not ‘sample_id’
position_covariate – the position covariate to be used
title – the title of the plot
**kwargs – the keyword arguments passed to plt.imshow()
- plot_methylation_trace(numerators: list[str], denominator: str, min_coverage: int, group: str = 'sample_id', rolling_window_size: int = 100, title: str | None = None, facet_kws: dict | None = None, **kwargs)#
Plot methylation fraction trace across a genomic region. This function is intended to use on a small region of the genome, as it will load all the data into memory. One can access such small region by subsetting the dataset, e.g., doing ds[f”{chr}:{start}-{end}”] or calling subset_byranges with a single range.
- Parameters:
numerators – List of numerator variables, can be single or multiple. Also accepts a single str comma separated.
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
group – the group to be plotted
rolling_window_size – the size of the rolling window for smoothing
title – the title of the plot
facet_kws – the keyword arguments passed to sns.relplot() as facet_kws to the FacetGrid, e.g sharey. If None, defaults to {‘sharey’: False}.
**kwargs – The keyword arguments passed to sns.relplot()
- Returns:
The figure object.
- plot_pearson_matrix(numerator: str, denominator: str, min_coverage: int, **kwargs)#
Plot a Pearson correlation matrix. This function calculates the Pearson correlation coefficient between the methylation fractions across samples, using the compute_pearson_matrix function, and plots the matrix as a heatmap.
- Parameters:
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
**kwargs – the keyword arguments passed to plt.subplots(). For instance, vmin and vmax can be used to set the minimum and maximum values of the colorbar. labelsize will set the size of the labels x and y axis.
- Returns:
The figure and axes objects.
- plot_six_base_ternary(denominator: str = 'num_total', min_coverage: int = 10, cmap=None, gridsize: int = 50, figsize=None, include_colorbar: bool = True, **kwargs)#
Creates a ternary plot of 6-base data, showing the distribution of C, 5mC, and 5hmC fractions across samples.
- Parameters:
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
cmap – the colormap to use
gridsize – the number of hexagons in the x-direction
figsize – the size of the figure
include_colorbar – whether to include a colorbar, showing the logged counts of contexts in a given bin.
**kwargs – the keyword arguments passed to plt.figure()
- Returns:
The figure and axes objects.
- pop([index]) item -- remove and return item at index (default last). #
Raise IndexError if list is empty or index is out of range.
- reduce_bybins(column: Array | ndarray, bins: Array | ndarray, var: str | list[str], counts_only: bool = False, sum_only: bool = False, chunks: int = 100000) Dataset #
Reduce positions by binning column by provided bins. This is useful for looking at the distribution of data variables across a continuous variable such as distance to nearest TSS or CpG richness.
- Parameters:
column – an array of length equal to the ContigDataset position dimension with the values by which to reduce the samples
bins – an array of bin values to use for grouping the values in column. Note: bins must be explicitly provided
var – data variable(s) to reduce
counts_only – return only the count of data entries
sum_only – return only the sum of data entries
chunks – column and bins chunks size
- Returns:
xarray Dataset with the ContigDataset data reduced by column. Includes bin_start, bin_end and bin_midpoint.
- reduce_byranges(ranges: PyRanges | list[PyRanges], var: str | list[str], counts_only: bool = False, sum_only: bool = False, min_count: int = 1, remove_non_overlapping_ranges: bool = False, store_result_to_disk=True)#
Reduce data variables by the provided genomic ranges in order to gain a summarized view of the data over ranges. Returns a new xarray.Dataset with the data summarized over the provided ranges. This function computes results immediately, and is NOT a deferred operation. Important: Ranges are assumed to be 0 based. If ranges are 1 based, set ranges_are_1based=True.
- Parameters:
ranges (pr.PyRanges or list[pr.PyRanges], required) – PyRanges object(s) to reduce by. If a list of PyRanges objects is provided, they will be concatenated before reducing.
var (str or list[str], required) – Data variable(s) to reduce. If a list of data variables is provided, they will be reduced separately.
counts_only (bool, optional) – If True, only the count of data entries will be returned. Default is False.
sum_only (bool, optional) – If True, only the sum of data entries will be returned. Default is False.
min_count (int, optional) – Minimum number of data entries required in each interval of range to calculate mean. Default is 1.
ranges_are_1based (bool, optional) – If True, ranges are assumed to be 1-based. Default is False (i.e. 0 based).
merge_range_intervals (bool, optional) – If True, intervals within the ranges will be merged if they overlap. Default is False.
remove_non_overlapping_ranges (bool, optional) – If True, ranges that do not overlap with the dataset will be removed. Default is False.
store_result_to_disk (bool, optional) – If True, the result will be stored to disk and available via a dask deferred operation. Default is True. Recommended to leave as True, unless the number of regions is small (<500_000).
- Returns:
Dataset with the data summarized over the provided ranges. The dataset will contain the following variables: - For each data variable in var:
<data_variable>_sum: the sum of data entries for the given range.
<data_variable>_mean: the mean of data entries for the given range.
<data_variable>_cpg_count: the non-NaN count of contexts for the given range.
Additional variables from the ranges dataframe.
- Return type:
xr.Dataset
Note
<data_variable>_cpg_count returns the count of non-NaN data entries for the given range. If you pass a list of two variables to var, for instance num_modc and frac_modc, the corresponding <data_variable>_cpg_count could be different for each variable. This is because frac_modc is generated by ds.assign_fractions(…, min_converage=X) and contains NaN values where the coverage is below X. In this case, the <data_variable>_cpg_count for num_modc will be the same as the <data_variable>_cpg_count for frac_modc where the coverage is above X, otherwise, it will be different
- property ref_genome: str#
Returns the reference genome used for the dataset.
- remove(value)#
S.remove(value) – remove first occurrence of value. Raise ValueError if the value is not present.
- rename_samples(**kwargs)#
Renames sample IDs with various available methods
- Parameters:
None
- Keyword Arguments:
names (# for mapping old to new)
mapping (dict) – A dictionary mapping old IDs to new IDs.
ID (# for applying a function to each sample)
func (callable) – A function to transform each sample ID.
IDs (# for removing a common prefix from all sample)
old_substr (str) – A substring to be replaced.
new_substr (str) – The replacement substring. Must be specified, i.e. does not default to “”
count (int, optional) – Number of occurrences to replace. Defaults to all.
IDs
remove_common_prefix (bool, optional) – If True, removes the common prefix.
prefix (str, optional)
reset (bool, optional) – If True, resets IDs to “SAMPLE_1”, “SAMPLE_2”, etc.
prefix – Prefix to use when resetting IDs. Defaults to “SAMPLE”.
- Returns:
The modified ContigDataset with updated sample ids.
- Return type:
self (contig_dataset.ContigDataset)
Note: - Order of precedence is mapping, func, (old_substr, new_substr), remove_common_prefix, reset. - Previous versions of modality allowed a list of sample IDs to be passed to this method.
This has been removed to avoid potential ordering bugs.
The ContigDataset is modified in place, but self is returned to allow chaining.
- reverse()#
S.reverse() – reverse IN PLACE
- property sample_ids: list[str]#
Returns the list of sample IDs in the dataset. If the dataset has been grouped, returns the original sample IDs by flattening the provenance-tracked sample_id coordinate.
- property sample_ix: DataArray#
Returns the dimension along which samples are indexed. This will normally be “sample_id” but can be different if the dataset has been manipulated, e.g. by a groupby operation.
- select_samples(sample_list)#
Select a subset of samples from the dataset.
- Parameters:
sample_list (str or list) – the sample or list of samples to be selected.
- Returns:
the ContigDataset instance with the selected samples.
- Return type:
self (contig_dataset.ContigDataset)
- property slices: dict#
Returns: generated slices dictionary
- subset_bycontigs(contigs: list[str], write_to_disk: bool = False, output_path=None)#
Subset the dataset by contigs, returning a new dataset with only the positions that fall within the provided contigs.
- Example use:
> ds.subset_bycontigs([“chr1”, “chr2”, “chrX”])
- Parameters:
contigs (list) – The list of contigs to subset to
write_to_disk (bool, optional) – whether to write the subset to disk.
output_path (str, optional) – the path to write the subset to (required if write_to_disk is True).
- Returns:
the subset ContigDataset.
- Return type:
ds (ContigDataset)
Note
If write_to_disk is True, the function will write the subset to disk and return a new ContigDataset read from the written file. If write_to_disk is False, the function will return the subset ContigDataset without writing to disk.
- subset_bycoverage(min_coverage: int = 0, max_coverage=None, coverage_array='num_total_c', method='all', write_to_disk: bool = False, output_path=None)#
Subset the Contig dataset by coverage, returning a new dataset with only the positions that fall within the provided coverage range. Remove positions where the coverage is strictly lower than min_coverage and strictly higher than max_coverage if provided. The removal strategy is determined by the method parameter.
- Parameters:
min_coverage (int, optional) – the minimum coverage required for a position to be included in the subset.
max_coverage (int, optional) – the maximum coverage required for a position to be included in the subset.
coverage_array (str, optional) – the data variable to use for coverage.
method (str, optional) – the method to use for determining coverage. One of [“all”, “any”, “mean”]. - “all” removes the position only if all the samples have a coverage strictly lower than min_coverage and strictly higher than max_coverage if provided. - “any” removes the position if any of the samples have a coverage strictly lower than min_coverage and strictly higher than max_coverage if provided. - “mean” removes the position if the mean coverage across all samples is strictly lower than min_coverage and strictly higher than max_coverage if provided.
write_to_disk (bool, optional) – whether to write the subset to disk.
output_path (str, optional) – the path to write the subset to (required if write_to_disk is True).
- Returns:
the subset ContigDataset.
- Return type:
ds (ContigDataset)
Note
If write_to_disk is True, the function will write the subset to disk and return a new ContigDataset read from the written file. If write_to_disk is False, the function will return the subset ContigDataset without writing to disk.
- subset_byranges(ranges: PyRanges | list[PyRanges], ranges_are_1based: bool = False, merge_range_intervals: bool = False, write_to_disk: bool = False, output_path=None)#
Subset the dataset by ranges, returning a new dataset with only the positions that fall within the provided ranges.
Important
Ranges are assumed to be 0 based. If ranges are 1 based, set ranges_are_1based=True.
Ranges are half-open intervals, i.e. [start, end), where start is inclusive and end is exclusive.
- Parameters:
ranges (pr.PyRanges or list[pr.PyRanges]) – PyRanges object(s) to subset by. If a list of PyRanges objects is provided, they will be concatenated before reducing.
ranges_are_1based (bool, optional) – If True, ranges are assumed to be 1-based. Default is False (i.e. 0 based).
merge_range_intervals (bool, optional) – If True, intervals within the ranges will be merged if they overlap. Default is False.
write_to_disk (bool, optional) – Whether to write the subset to disk.
output_path (str, optional) – the path to write the subset to (required if write_to_disk is True).
- Returns:
the subset ContigDataset.
- Return type:
subset_data (ContigDataset)
Note
If write_to_disk is True, the function will write the subset to disk and return a new ContigDataset read from the written file. If write_to_disk is False, the function will return the subset ContigDataset without writing to disk.
- summary_cpg_coverage()#
Return a dataframe with the number of CpGs at each coverage depth.
- to_bedgraph(sample_ids=None, modification_count_column: str = 'num_modc', denominator_column: str = 'num_total_c', tag: str = '', file_template: str = '{sample_id}.{tag}{var}', additional_columns=None, output_dir: str = '.', skip_rows_zero_counts: bool = True, tabix_index: bool = False)#
Write data to a bedGraph format: https://genome.ucsc.edu/goldenPath/help/bedgraph.html
A bedgraph contains 4 required columns: chromosome, start, end, value. The value is the quantification value. We can append additional columns to the bedgraph, to provide further information.
The bedGraph format contains a header line that begins with the word “track” followed by a space-separated list of key-value pairs. The key-value pairs are of the form key=value. The track line must end with a newline character.
>>> from modality.mock_contig_dataset import make_mock_5hmc_cds >>> cds = make_mock_5hmc_cds({"chr1": 1_000_000, "chr2": 1_000_000}) >>> cds = cds.assign_fractions(['num_hmc', 'num_mc'], denominator='num_total_c', min_coverage=0) >>> bedgraphs = cds.to_bedgraph(output_dir="/tmp")
- Parameters:
file_template
sample_ids – sample IDs to export bedGraphs for. Note: A separate file will be created for each sample. If None, all samples are exported.
modification_count_column – the data variable to export. If you are working with evoC data, this will typically be num_hmc for 5hmC or num_mc for 5mC.
denominator_column – the column to use to calculate the methylation fraction, and to use as the coverage. Default: “num_total_c”, if you wish to include all bases use “num_total”.
additional_columns – Extra data variable(s) to export. Both strings and list arguments are supported. Note 1: providing multiple column names will result in the bedGraph containing >4 columns and some downstream tools may not support this.
file_template – the template for the file name. The sample_id and value_column will be inserted into the template. Default: “{sample_id}.{var}.bdg.gz”
output_dir – path to output directory
skip_rows_nan_values – whether to skip rows with missing/nan values in the value_column. Default: True
tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False
- Returns:
A list of created file paths.
Notes
Genomic coordinates are 0-based and half-open. An uncompressed output is not supported due to write speed and file size considerations. If you
require an uncompressed output, please use gunzip post-processing.
- to_bedmethyl(sample_ids=None, modification_count_column: str = 'num_modc', denominator_column: str = 'num_total_c', tag: str = '', file_template: str = '{sample_id}.{tag}{var}', output_dir: str = '.', skip_rows_zero_counts: bool = True, tabix_index: bool = False)#
Write data to a bedmethyl format. A 11 column tab-separated file format that contains the following columns:
Reference chromosome or scaffold
Start position in chromosome
End position in chromosome
Name of item
Score from 0-1000. Capped number of reads
Strandedness, plus (+), minus (-), or unknown (.)
Start of where display should be thick (start codon)
End of where display should be thick (stop codon)
Color value (RGB)
Coverage, or number of reads
Percentage of reads that show methylation at this position in the genome
https://www.encodeproject.org/data-standards/wgbs/
- Parameters:
sample_ids – sample IDs to export cxreports for. Note: A separate file will be created for each sample. If None, all samples are exported.
modification_count_column – the data variable to export. If you are working with evoC data, this will typically be num_hmc for 5hmC or num_mc for 5mC.
denominator_column – the column to use to calculate the methylation fraction, and to use as the coverage. Default: “num_total_c”, if you wish to include all bases use “num_total”.
file_template – the template for the file name. The sample_id and count_methylated_column will be inserted into the template. Default: “{sample_id}.{var}_cxreport.txt.gz”
output_dir – path to output directory
skip_rows_zero_counts – whether to skip rows with zero counts in the denominator column. Default: True
tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False
- Returns:
A list of created file paths.
Notes
- An uncompressed output is not supported due to write speed and file size considerations. If you require an
uncompressed output, please use gunzip post-processing.
- to_bismark_cov_report(sample_ids=None, modification_count_column: str = 'num_modc', denominator_column: str = 'num_total_c', tag: str = '', file_template: str = '{sample_id}.{tag}{var}', output_dir: str = '.', skip_rows_zero_counts: bool = True, tabix_index: bool = False)#
Write data to a bismark coverage report format, see: https://support.illumina.com/help/BaseSpace_App_MethylSeq_help/Content/Vault/Informatics/Sequencing_Analysis/Apps/swSEQ_mAPP_MethylSeq_Bismark_Coverage.htm This is a 6 column tab-separated file format that contains the following columns: <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
- Parameters:
sample_ids – sample IDs to export files for. Note: A separate file will be created for each sample. If None, all samples are exported.
modification_count_column – the data variable to export. If you are working with evoC data, this will typically be num_hmc for 5hmC or num_mc for 5mC.
denominator_column – the column to use to calculate the methylation fraction, and to use as the coverage. Default: “num_total_c”, if you wish to include all bases use “num_total”.
tag – a tag to add to the file name. Default: “”
file_template – the template for the file name. The sample_id and count_methylated_column will be inserted into the template. Default: “{sample_id}.{var}_bed.txt.gz”
output_dir – path to output directory
skip_rows_zero_counts – whether to skip rows with zero counts in both methylation columns. Default: True
tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False
- Returns:
A list of created file paths.
Notes
Methylation percentage is calculated as 100 * count_methylated / denominator_column. The <count unmethylated> column as described in the bismark format is calculated as denominator -
count_methylated. In a 3 state system (m, h, u) this column will therefore contain the sum of unmodified bases and bases with ANOTHER modification to that specified in the modification_count_column.
- An uncompressed output is not supported due to write speed and file size considerations. If you require an
uncompressed output, please use gunzip post-processing.
- to_cxreport(sample_ids=None, modification_count_column: str = 'num_modc', denominator_column: str = 'num_total_c', tag: str = '', file_template: str = '{sample_id}.{tag}{var}', output_dir: str = '.', skip_rows_zero_counts: bool = True, tabix_index: bool = False)#
Write data to a cxreport format:
The cx report format is a tab-separated file format that contains the following columns: <chromosome> <position> <strand> <count methylated> <count unmethylated> <C-context> <trinucleotide context> The position is 1-based as opposed to modality which is 0 based. https://support-docs.illumina.com/SW/DRAGEN_v39/Content/SW/DRAGEN/MPipelineBias_fDG.htm
- Parameters:
sample_ids – sample IDs to export cxreports for. Note: A separate file will be created for each sample. If None, all samples are exported.
modification_count_column – the data variable to export. If you are working with evoC data, this will typically be num_hmc for 5hmC or num_mc for 5mC.
denominator_column – the column to use to calculate the methylation fraction, and to use as the coverage. Default: “num_total_c”, if you wish to include all bases use “num_total”.
file_template – the template for the file name. The sample_id and count_methylated_column will be inserted into the template. Default: “{sample_id}.{var}_cxreport.txt.gz”
output_dir – path to output directory
skip_rows_zero_counts – whether to skip rows with zero counts in both methylation columns. Default: True
tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False
- Returns:
A list of created file paths.
Notes
This function is only applicable for Zarr stores generated with duet pipeline version >=1.4.0 as this will contain the required encoded column, “trinucleotide_context” for the cxreport format. The trinucleotide context is compressed through encoding and requires decoding to extract the context sequences. The two decoded columns dinucleotide_text and trinucleotide_text will automatically be decoded when exporting a cxreport but they can also be generated manually by running:
>>>> cds = cds.assign_text_nucleotide_contexts(“trinucleotide_context”)
- If needed, you can also create empty dummy “.” columns using the following code:
>>>> cds = cds.assign_text_nucleotide_contexts(“dummy”)
- The <count unmethylated> column as described in the bismark format is calculated as denominator -
count_methylated. In a 3 state system (m, h, u) this column will therefore contain the sum of unmodified bases and bases with ANOTHER modification to that specified in the modification_count_column.
- An uncompressed output is not supported due to write speed and file size considerations. If you require an
uncompressed output, please use gunzip post-processing.
- to_quant(sample_ids=None, tag: str = '', file_template: str = '{sample_id}.{tag}{var}', mode='infer', output_dir: str = '.', skip_rows_zero_counts: bool = True, modc_degenerate_in_evoC: bool = False, tabix_index: bool = False)#
Write data to a biomodal quant report format. A 6 or 8 column tab-separated file format that contains the following columns:
# contig ref_position num_modC num_mC num_hmC num_C num_total strand for evoC
# contig ref_position num_modC num_C num_total strand for +modC
- Parameters:
sample_ids – sample IDs to export cxreports for. Note: A separate file will be created for each sample. If None, all samples are exported.
file_template – the template for the file name. The sample_id and count_methylated_column will be inserted into the template. Default: “{sample_id}.{var}.quant.txt.gz”
mode – The export mode for the quantification data. Options are: - “infer” (default): infer the mode based on the presence of the num_hmc column. The num_hmc column is present in 6-base (i.e. duet evoC) data, but not in duet +modC data. - “evoC”: for duet evoC (ie. 6 base) data - “modC”: for duet +modC data
output_dir – path to output directory
skip_rows_zero_counts – whether to skip rows with zero counts in the num_total column. Default: True
modc_degenerate_in_evoC – in duet evoC (i.e. 6 base), whether the modC column represents degenerate modifications (i.e., modified but uncertain between mC or hmC). If False (default), modC is the superset of mC, hmC, and degenerate modifications (mC or hmC), consistent with duet +modC behavior. If True, modC represents only the degenerate (mC or hmC) state.
tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False
- Returns:
A list of created file paths.
Notes
Genomic coordinates are 1-based and closed. An uncompressed output is not supported due to write speed and file size considerations. If you require an
uncompressed output, please use gunzip post-processing.
- to_zarrz(fpath: str, zipped: bool = True, overwrite: bool = False)#
Write the ContigDataset to a zarr store directory or zipped zarr store file.
- Parameters:
fpath – the path to the zarr store
zipped – whether to use a zipped zarr store. Default is True.
overwrite – whether to overwrite an existing zarr store. Default is False.
- Returns:
the path to the zarr store
- Return type:
Path
- validate_methylation_contexts(fields: list[str]) None #
Validates that the provided fields are valid methylation contexts.
- Parameters:
fields – List of fields to validate.
- Raises:
ValueError – If any field is not a valid methylation context.
- vstack(rest, dim='pos')#
Stack ContigDatasets by contig.
- Parameters:
rest (ContigDataset or list) – the ContigDatasets to be stacked.
dim (str) – the dimension to be stacked.
- Returns:
the stacked ContigDataset.
- Return type:
- where(cond, other=-1, drop=True)#
This method filters the dataset based on a given condition, optionally replacing values where the condition is False and dropping dimensions where the condition is not met.
- cond (xarray.DataArray or similar): A boolean condition to be applied.
Must be 1D and aligned with a named dimension in the dataset.
other (optional): The value to use where the condition is False. Defaults to -1. drop (bool, optional): Whether to drop dimensions where the condition is False.
Defaults to True.
ContigDataset: A new ContigDataset instance with the condition applied.
- Raises:
ValueError – If the condition is not 1D or is not aligned with a named dimension in the dataset.
Notes
The default value for other is -1, differing from the default np.nan in xarray’s where method.
The condition must be aligned with a named dimension in the dataset, such as ‘pos’ or ‘sample_id’.
See also
xarray.Dataset.where: The underlying xarray method used for applying the condition.
modality.concat_contig_datasets#
- modality.concat_contig_datasets(datasets: list[ContigDataset], write_to_disk: bool = True, output_path: str | None = None) ContigDataset #
Orchestrates the concatenation of multiple ContigDatasets.
- Parameters:
datasets – List of ContigDatasets or paths to input Zarr files.
output_path – Path to the output combined Zarr file.
write_to_disk – Whether to write the result to disk.
- Returns:
The concatenated dataset.
- Return type:
- Raises:
ValueError – If the output path is invalid or if the datasets are incompatible.
modality.load_biomodal_dataset#
- modality.load_biomodal_dataset(name: str = 'esE14', pipeline_version: str = 'v1.4.1', chr2_only: bool = False, merge_technical_replicates=False, merge_cpgs=False) ContigDataset #
- modality.load_dataset_description(name=None) str #
modality.analysis module#
- class modality.analysis.DMRBatchWorkflow(ds: ContigDataset | Dataset, output_dir: str)#
Batch processing workflow for DMR calling.
DMRBatchWorkflow extends DMRWorkflow to support large-scale, batch-based differential methylation analysis. Unlike DMRWorkflow, which processes a single condition comparison at a time, DMRBatchWorkflow allows for multiple condition comparisons and integrates sample metadata.
The workflow provides additional functionality, such as: - Batch processing of condition pairs - Handling sample sheets for handling metadata and experimental design - Configurable methylation contexts - Automatic filtering of low-coverage contexts
Notes
This workflow enforces a default window size of 1000bp for segmentation, unlike DMRWorkflow, which relies on adaptive segmentation thresholds.
The run method processes all condition pairs defined in the dataset or sample sheet and saves to the output directory as a bed file
- Parameters:
ds (ContigDataset | xr.Dataset) – The input dataset containing methylation and condition data.
output_dir (str) – The directory where results will be saved.
- include_headers(include_headers: bool)#
Whether or not to include headers in the saved DMR file.
- run(group_name: str, _count_array_name: str = 'num_modc', _total_counts_name: str = 'num_total_c') DMRResult #
Execute batch DMR calling on the dataset.
This method runs differential methylation analysis across multiple condition pairs, applying filtering and metadata alignment as needed.
- Parameters:
group_name (str) – The categorical variable (e.g., “sex”, “treatment_group”) used for condition comparison.
- Returns:
The computed differential methylation regions (DMRs).
- Return type:
- Raises:
ModalityError – If an error occurs during processing.
Notes
If a sample sheet is provided, it will be aligned and validated against the ContigDataset before processing.
Methylation contexts default to [“num_modc”] if not explicitly set.
Conditions are automatically ordered, parsed and paired.
Low-coverage contexts can be filtered out using set_filter_context_depth().
- set_compress_dmr_file(compress: bool)#
Whether or not to compress the DMR file.
- set_include_headers_in_dmr_file(include_headers: bool)#
Whether or not to include column headers in the saved DMR file.
- class modality.analysis.DMRResult(df: DataFrame)#
DMRResult class for handling Differentially Methylated Regions (DMR) results. Attributes df : pd.DataFrame
DataFrame containing the DMR results.
Methods __init__(df: pd.DataFrame)
Initialize the DMRResult with a DataFrame.
- __getattr__(attr)
Delegate attribute access to the underlying DataFrame.
- __setitem__(key, value)
Set item in the underlying DataFrame.
- __getitem__(key)
Get item from the underlying DataFrame.
- __len__()
Get the length of the underlying DataFrame.
- validate(df: pd.DataFrame)
Validate the DataFrame for required columns and order.
- to_pyranges()
Convert the DataFrame to a PyRanges object.
- set_plotter(plotter: Literal[“seaborn”, “plotly”])
Set the plotting backend to either seaborn or plotly.
- filter_by_significance(cutoff: float, filter_on: str = “qvalue”)
Filter DMR results by significance threshold.
- sort_by_qvalue()
Sort the DMR results by q-value.
- sort_by_pvalue()
Sort the DMR results by p-value.
- to_dmrfile(out_path, include_headers: bool = True)
Save the DMR results to a file.
- from_dmrfile(cls, dmrfile: str, has_headers: bool = True) -> “DMRResult”
Load a DMRResult from a DMR file.
- preview(n=5)
Display a preview of the DMR results.
- plot_volcano(x_col: str = “mod_difference”, y_col: str = “dmr_qvalue”, significance_threshold: float = 0.05)
Plot a volcano plot using the selected plotting backend.
- plot_manhattan(x_col: str = “Start”, y_col: str = “dmr_qvalue”, significance_threshold: float = 0.05)
Plot a Manhattan plot using the selected plotting backend.
- filter_dmr_results(cutoffs: dict)#
Filter DMR results by providing cutoffs.
- Parameters:
cutoffs (dict) –
All filters will either be non-strictly less or greater than the provided value for example, filtering by dmr_pvalue or dmr_qvalue will return all values less than or equal to the provided value. The mod_difference filter will filter the mod_difference by the absolute value i.e. 0.2 will return all values >= 0.2 but also <= -0.2.
All cutoffs Dictionary containing the cutoff values for the allowed filters where dictionary takes the format:
- { “dmr_qvalue”: value,
”dmr_pvalue”: value, “mod_difference”: absolute(value)
}
- Returns:
A new DMRResult where the DMR results have been filtered by the specified cutoff values.
- Return type:
- classmethod from_dmrfile(dmrfile: str) DMRResult #
Load a DMRResult from a DMR file.
Automatically detects if the file has headers by examining the first non-comment line. Also handles provenance comments (lines starting with #). Supports any compression format that pandas supports.
- Parameters:
dmrfile (str) – Path to the DMR file.
- Returns:
A new DMRResult object containing the data from the file.
- Return type:
- plot_manhattan(x_col: str = 'Start', y_col: str = 'dmr_qvalue', plot_significance_threshold: float = 0.05, **kwargs)#
Plot a Manhattan plot using DMRPlotting.
- Parameters:
x_col (str) – The column name for the x-axis (default is ‘Start’).
y_col (str) – The column name for the y-axis (default is ‘dmr_qvalue’).
plot_significance_threshold (float) – Plot the significance threshold for q-values as a horizontal line (default is 0.05). This does not filter the data.
- plot_volcano(x_col: str = 'mod_difference', y_col: str = 'dmr_qvalue', plot_significance_threshold: float = 0.05, mod_diff_thresholds: tuple[float, float] | None = None, **kwargs)#
Plot a volcano plot using DMRPlotting.
- Parameters:
x_col (str) – The column name for the x-axis (default is ‘mod_difference’).
y_col (str) – The column name for the y-axis (default is ‘dmr_qvalue’).
plot_significance_threshold (float) – Plot the significance threshold for q-values as a horizontal line (default is 0.05). This does not filter the data.
mod_diff_thresholds (tuple[float, float]) – Plot threshold lines for hypo and hyper methylation (e.g., (-0.1, 0.1)).
- class modality.analysis.DMRWorkflow(ds: ContigDataset | Dataset)#
Call DMRs from a ContigDataset.
DMRs are defined as regions with a high variance in methylation levels, and a significant difference in methylation levels between conditions.
Notes
Our approach follows that of Methylkit [1]. The contexts of each region are summed together to get the total number of methylated and unmethylated counts. A DMR is defined as a region with a significant difference in methylation levels between two conditions. This implementation uses a logistic regression to model the methylation levels of each region. Assume that we have \(N\) samples, and for each sample we have a methylation level \(p_i \; (i=1,\ldots,N)\). In addition, we have a categorical group variable \(G\) which is 0 or 1. The group variable is used to define the two conditions that we want to compare. Each sample is assigned to one of the two groups. Finally, we have an optional set of covariates that we want to adjust for.
The full model is defined as:
\[\text{logit}(p_i) = \beta_0 + \beta_G \times G_i + \beta_1 \times covariate_1 + \beta_2 \times covariate_2 + ...\]The null model is defined as:
\[\text{logit}(p_i) = \beta_0 + \beta_1 \times covariate_1 + \beta_2 \times covariate_2 + ...\]Overdispersion can be accounted for by using the
overdispersion
parameter (default isFalse
). If set toTrue
, the model will estimate an overdispersion parameter for each region, based on the approach described in McCullagh and Nelder (1989, chapter 4.5 [2]). This parameter is then used to adjust the standard errors of the model coefficients.We derive a p-value which tests the null hypothesis that the null model is correct (i.e., \(\beta_G=0\)), using a Wald test. We can then correct this p-value for multiple testing using the Benjamini-Hochberg method, which is the
dmr_qvalue
column in the output.References
- run(group_name: str, count_array_name: str, total_counts_name: str = 'num_total_c') DMRResult #
Run DMR calling on the dataset. :param group_name: Name of the category of interest in
dataset
, i.e. the variable to test for differential methylation (“sex”, “treatment_group”, etc…). :param count_array_name: Name of the methylation count array of interest indataset
(usuallynum_modc
,num_mc
ornum_hmc
). :param total_counts_name: Name of the total counts array indataset
.- Returns:
DMRResult - A wrapper around a DataFrame with additional convenience methods The dataframe is sorted by ascending p-values. The structure of the DMR dataframe is as follows:
The first three columns are the contig, start, and end of the genomic range or tile.
num_contexts
is the number of contexts in the tile.mean_coverage
is the mean coverage of the contexts in the tile.mean_mod_group_1
is the mean methylation level of group 1 (by default group 1 is the control group).mean_mod_group_2
is the mean methylation level of group 2.mod_fold_change
is the fold change of the mean methylation levels between the two conditions,np.mean(condition_2)/np.mean(condition_1)
.mod_difference
is the mean methylation difference between the two conditions, defined asnp.mean(condition_2) - np.mean(condition_1)
.test_statistic
is the Wald test statistic associated with the p-value.dmr_pvalue
is the p-value of significance of the difference in methylation levels between the two conditions.dmr_qvalue
is the FDR-corrected p-value.
- set_condition_order(condition_order: list[str])#
Set the order of conditions for DMR comparisons.
This function ensures a fixed order for condition comparisons:
If condition_order is not specified, conditions are used in the order they appear in the sample sheet.
If condition_order is specified, the first condition is treated as the control, and all subsequent conditions are compared against it. Additionally, pairwise comparisons are performed among the remaining conditions.
Example
If a user specifies:
`python set_condition_order(["CONTROL", "I", "II", "III"]) `
The function will generate the following comparisons:
CONTROL vs I
CONTROL vs II
CONTROL vs III
I vs II
I vs III
II vs III
- Parameters:
condition_order (list[str]) – A list defining the order of conditions for comparison.
- Returns:
self – The workflow instance with updated condition order.
- Return type:
- set_covariates(covariates: str | list[str])#
An optional set of covariates that we want to adjust for in the DMR test.
- set_regions(bedfile: str | None = None, window_size: int | None = None, ranges: PyRanges | None = None, presegment_threshold: int = 500, presegment_maxsize: int = 5000, presegment_min_contexts: int = 5)#
Define genomic regions for DMR analysis.
This method sets the regions used for DMR analysis. Specify one of the following: - A BED file (bedfile) - A fixed window size (window_size) - A predefined set of regions as a PyRanges object (ranges)
If none of these are provided, presegmentation parameters can be adjusted instead.
Constraints: - Only one of bedfile, window_size, or ranges can be set. - Presegmentation parameters (presegment_threshold, presegment_maxsize, presegment_min_contexts)
cannot be used when specifying bedfile, window_size, or ranges.
- Parameters:
bedfile (str | None, optional) – Path to a BED file defining genomic regions.
window_size (int | None, optional) – Fixed window size for segmentation.
ranges (pr.PyRanges | None, optional) – Predefined PyRanges object defining genomic regions.
presegment_threshold (int, optional) – Minimum distance threshold for presegmenting regions (default: 500).
presegment_maxsize (int, optional) – Maximum allowed region size for presegmentation (default: 5000).
presegment_min_contexts (int, optional) – Minimum number of contexts required in a presegmented region (default: 5).
- Returns:
self – The workflow instance with updated region settings.
- Return type:
- Raises:
ValueError – If more than one of bedfile, window_size, or ranges is specified. If presegmentation parameters are set alongside bedfile, window_size, or ranges.
- modality.analysis.catplot(featureset: Dataset, **kwargs)#
Wrapper around seaborn’s catplot function.
{plot_wrapper.__doc__.format(plot_type_name=”catplot”)}
This function specifically handles creating categorical plots.
- modality.analysis.displot(featureset: Dataset, **kwargs)#
Wrapper around seaborn’s displot function.
{plot_wrapper.__doc__.format(plot_type_name=”displot”)}
This function specifically handles creating distribution plots.
- modality.analysis.lmplot(featureset: Dataset, **kwargs)#
Wrapper around seaborn’s regplot function.
{plot_wrapper.__doc__.format(plot_type_name=”regplot”)}
This function specifically handles creating regression plots.
- modality.analysis.plot_correlation_heatmaps(cds: ContigDataset, numerator: str, denominator: str, min_coverage: int, title: str | None = None, **kwargs)#
Plot a correlation across samples, as a heatmap of the methylation fractions, for each pair of samples.
- Parameters:
cds – The dataset to be plotted.
var – The variable to be plotted.
min_coverage – The minimum coverage.
**kwargs – The keyword arguments passed to plt.subplots()
- Returns:
The figure object. axes: The axes object.
- Return type:
fig
- modality.analysis.plot_correlation_scatter(cds: ContigDataset, numerator: str, denominator: str, min_coverage: int, **kwargs)#
Plot the correlation across samples, as a scatter plot of the methylation fractions for each pair of samples.
- Parameters:
cds – The dataset to be plotted.
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
**kwargs – The keyword arguments passed to plt.subplots().
- Returns:
The figure and axes objects.
- modality.analysis.plot_coverage_distribution(cds: ContigDataset, var='num_total_c', bins=None, norm_hist=False, num_max_tick_labels=10, return_data=False, **kwargs)#
Plot the distribution of coverage across samples.
- Parameters:
cds – The dataset to be plotted.
var – the variable to plot. Most likely one of [“num_total_c”].
bins – the number of bins or an array of bins. IMPORTANT: the auto-binning may not return an array of bin edges with the desired number of bins. If you want full control over number of bins, you should pass an array of bin edges.
norm_hist – whether to normalise the histogram
num_max_tick_labels – the maximum number of tick labels on the x-axis.
return_data – whether to return the data used to plot the histogram
**kwargs – the keyword arguments passed to sns.lineplot()
- modality.analysis.plot_histogram(cds: ContigDataset, var='frac_mc', bins=26, norm_hist=False, num_max_tick_labels=7, return_data=False, **kwargs)#
Plot a histogram from the data in the ContigDataset.
- Parameters:
cds – The dataset to be plotted.
var – the variable to plot
bins – the number of bins or an array of bins. IMPORTANT: the auto-binning may not return an array of bin edges with the desired number of bins. If you want full control over number of bins, you should pass an array of bin edges.
norm_hist – whether to normalise the histogram
num_max_tick_labels – the maximum number of tick labels on the x-axis.
return_data – whether to return the data used to plot the histogram
**kwargs – the keyword arguments passed to sns.catplot()
- Returns:
the catplot object
- Return type:
g
- modality.analysis.plot_methylation_tile(cds: ContigDataset, numerator: str, denominator: str, min_coverage: int, sample_label: str = 'sample_id', position_covariate: None | CovariateObject = None, title: str | None = 'auto', **kwargs)#
Plot a methylation tile plot. This plot is useful for displaying structure present in the cohort. This plot is intended for small regions of the genome, as it will load all the data into memory. One can access such small region by subsetting the dataset, e.g., doing ds[f”{chr}:{start}-{end}”] or calling subset_byranges with a single range.
- Parameters:
cds – The dataset to be plotted.
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
sample_label – the label to be used for the sample if not ‘sample_id’
position_covariate – the position covariate to be used
title – the title of the plot
**kwargs – the keyword arguments passed to plt.imshow()
- modality.analysis.plot_methylation_trace(cds: ContigDataset, numerators: list[str], denominator: str, min_coverage: int, group: str = 'sample_id', rolling_window_size: int = 100, title: str | None = None, facet_kws: dict | None = None, **kwargs)#
Plot methylation fraction trace across a genomic region. This function is intended to use on a small region of the genome, as it will load all the data into memory. One can access such small region by subsetting the dataset, e.g., doing ds[f”{chr}:{start}-{end}”] or calling subset_byranges with a single range.
- Parameters:
cds – The dataset to be plotted.
numerators – List of numerator variables, can be single or multiple. Also accepts a single str comma separated.
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
group – the group to be plotted
rolling_window_size – the size of the rolling window for smoothing
title – the title of the plot
facet_kws – the keyword arguments passed to sns.relplot() as facet_kws to the FacetGrid, e.g sharey. If None, defaults to {‘sharey’: False}.
**kwargs – The keyword arguments passed to sns.relplot()
- Returns:
The figure object.
- modality.analysis.plot_pca_scatter(pca_object, pca_result, xaxis_pc: int = 1, yaxis_pc: int = 2, **kwargs)#
Generates a PCA scatter plot - uses seaborn.relplot
- Parameters:
pca_object (sklearn.decomposition.PCA|dask_ml.decomposition.PCA) – returned pca object from run_pca
pca_result (np.ndarray|da.Array) – Result of PCA transformation from run_pca
xaxis_pc (int) – Desired PC from pca_result to plot on the x-axis
yaxis_pc (int) – Desired PC from pca_result to plot on the y-axis
**kwargs – Additional arguments to be passed to seaborn’s relplot
- Returns:
The figure object
- modality.analysis.plot_pca_scree(pca_object, **kwargs)#
Generates a PCA scree plot - uses seaborn.catplot
- Parameters:
pca_object (sklearn.decomposition.PCA|dask_ml.decomposition.PCA) – returned pca object from run_pca
**kwargs – Additional arguments to be passed to seaborn’s catplot
- Returns:
The figure object
- modality.analysis.plot_pearson_matrix(cds: ContigDataset, numerator: str, denominator: str, min_coverage: int, **kwargs)#
Plot a Pearson correlation matrix. This function calculates the Pearson correlation coefficient between the methylation fractions across samples, using the compute_pearson_matrix function, and plots the matrix as a heatmap.
- Parameters:
cds – The dataset to be plotted.
numerator – the numerator variable
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
**kwargs – the keyword arguments passed to plt.subplots(). For instance, vmin and vmax can be used to set the minimum and maximum values of the colorbar. labelsize will set the size of the labels x and y axis.
- Returns:
The figure and axes objects.
- modality.analysis.plot_six_base_ternary(cds: ContigDataset, denominator: str = 'num_total', min_coverage: int = 10, cmap=None, gridsize: int = 50, figsize=None, include_colorbar: bool = True, **kwargs)#
Creates a ternary plot of 6-base data, showing the distribution of C, 5mC, and 5hmC fractions across samples.
- Parameters:
cds – The dataset to be plotted.
denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively
min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation
cmap – the colormap to use
gridsize – the number of hexagons in the x-direction
figsize – the size of the figure
include_colorbar – whether to include a colorbar, showing the logged counts of contexts in a given bin.
**kwargs – the keyword arguments passed to plt.figure()
- Returns:
The figure and axes objects.
- modality.analysis.relplot(featureset: Dataset, **kwargs)#
Wrapper around seaborn’s relplot function.
{plot_wrapper.__doc__.format(plot_type_name=”relplot”)}
This function specifically handles creating relational plots.
- modality.analysis.run_pca(input_array: DataArray, n_components: int = 2, svd_solver: str = 'randomized', nan_handling_strategy: str = 'drop', **kwargs)#
Run PCA on a 2D array of data.
- Parameters:
input_array (xr.DataArray) – The input array to run PCA on. The array should be 2D with sample_ids and contexts/ranges in the rows and columns respectively. If the array is not in this format, it will be transposed.
n_components (int) – The number of principal components to return.
svd_solver (str) – The solver to use for the PCA. Options are ‘auto’, ‘full’, ‘arpack’, ‘randomized’. See sklearn.decomposition.PCA for more details.
nan_handling_strategy (str) – The strategy to use for handling NaN values in the input array. Options are ‘drop’, ‘set_to_zero’ or ‘mean_value_imputation’. When using ‘mean_value_imputation’, columns that contain majority NaN values will be removed before imputation.
**kwargs – Additional keyword arguments to pass to the PCA class.
- Returns:
The PCA object fit to the input_array. transformed_data (np.ndarray|dask.Array): The transformed input_array after dimensionality reduction.
- Return type:
pca_object (sklearn.decomposition.PCA|dask_ml.decomposition.PCA)
modality.annotation module#
- modality.annotation.get_annotation(reference: str | None = None, contig: str | None = None, start: int | None = None, end: int | None = None, annotation_type: str | None = None, filterby: dict | None = None, exact_match: bool = False, as_pyranges: bool = True, gff3_file: str | None = None)#
Generic function for fetching annotation from GENCODE. All get_* functions should call this function.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
annotation_type – Annotation type.
filterby – Dictionary of filters to apply.
exact_match – Whether to apply exact matching to filters.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the requested annotation
- modality.annotation.get_cds(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
Retrieve the coding sequences of protein-coding genes for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the coding sequences
- modality.annotation.get_cpg_islands(reference: str = 'hg38', contigs: list | None = None, as_pyranges: bool = True)#
Retrieve the CpG islands in contigs.
- Parameters:
reference – Reference name.
contigs – List of contigs for which to return CpG islands.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the CpG islands
- modality.annotation.get_cpg_shelves(reference: str = 'hg38', contigs: list | None = None)#
Retrieve CpG shelves for given contigs as a PyRanges instance. Shelves: [-4000 + start, -2000 + start] + [end + 2000, end + 4000].
- Parameters:
reference – Reference name.
contigs – List of contigs for which to return CpG shelves.
- Returns:
(pyranges.PyRanges) PyRanges containing the CpG shelves
- modality.annotation.get_cpg_shores(reference: str = 'hg38', contigs: list | None = None)#
Retrieve CpG shores for given contigs as a PyRanges instance. Shores: [-2000 + start, start] + [end, end + 2000].
- Parameters:
reference – Reference name.
contigs – List of contigs for which to return CpG shores.
- Returns:
(pyranges.PyRanges) PyRanges containing the CpG shores
- modality.annotation.get_exons(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
Retrieve exons of protein-coding genes for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the exons
- modality.annotation.get_feature_dist(contig: str | None = None, ref_pos: DataArray | ndarray | list | None = None, strand: str = 'both', return_nearest: bool = True, features: PyRanges = None) tuple[ndarray, ndarray] | ndarray #
Retrieve the distance to the nearest genomic feature for a given range.
- Parameters:
contig – Contig name. Default is None, which returns distances for all contigs.
ref_pos – Reference positions as a NumPy array.
strand – Strand of the feature. If “both”, returns the nearest feature regardless of strand. If “+” or “-”, returns the feature on that strand.
return_nearest – Whether to return the nearest feature.
features – Required PyRanges object containing the feature coordinates.
- Returns:
Tuple with NumPy arrays of nearest feature distances and positions or a NumPy array of feature distances.
- modality.annotation.get_five_prime_utrs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
Retrieve the 5’ UTRs of protein-coding genes for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the 5’ UTRs
- modality.annotation.get_gencode_url(reference: str) str #
Generate the URL for the GENCODE annotation file for a given reference, updated Spring 2023.
https://www.gencodegenes.org/human/releases.html https://www.gencodegenes.org/mouse/releases.html
- Parameters:
reference – reference name
- Returns:
GENCODE URL
- modality.annotation.get_genes(reference: str | None = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, as_pyranges: bool = True, gff3_file: str | None = None)#
Fetch gene annotation ranges from GENCODE within a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
protein_coding – Whether to return only protein-coding genes.
filterby – Dictionary of filters to apply.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the genes
- modality.annotation.get_introns(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, as_pyranges: bool = True, strand: str | None = None, transcripts: list | None = None, nb_workers: int | None = None)#
Retrieve the introns of protein-coding genes for a given range. The implementation is based on the following: - GENCODE annotation has many transcripts per gene - For each transcript there are several exons, and usually one 5’UTR and one 3’UTR. - For each transcript_id, grab all the exons, 5’ and 3’ UTRs and merge them into a pyranges object. - This pyranges object will contain a gene_id for the gene associated with the transcript_id. So we can grab the relevant gene. - Now we have the start and end position of the gene, and the pyranges object, so we can use the subtract method from pyranges to get the introns as genes - (exons + 5’ + 3’ UTRs).
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
protein_coding – Whether to return only protein-coding genes.
filterby – Dictionary of filters to apply.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
strand – Strand.
exact_match – Whether to apply exact matching to filters.
transcripts – List of transcripts to consider. This is optional, but will make the function faster if you know the transcripts you are interested in.
nb_workers – Number of workers to use for parallel processing. Default to None to use all available cores.
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the introns.
Note: this function is slow, and we recommend preselecting the transcripts you are interested in.
- modality.annotation.get_lncRNAs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
Retrieve the long non-coding RNA genes for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the lncRNA genes
- modality.annotation.get_miRNAs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
Retrieve the microRNA genes for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the miRNA genes
- modality.annotation.get_promoters(reference: str | None = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, as_pyranges: bool = True, gff3_file: str | None = None)#
Retrieve the promoter ranges of protein coding genes - defined as 1KB upstream of the TSS.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
protein_coding – Whether to return only protein-coding genes.
filterby – Dictionary of filters to apply.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the promoter ranges
- modality.annotation.get_protocadherin_cluster(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
-
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the protocadherin clusters
- modality.annotation.get_start_codons(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
Retrieve the start codons of protein-coding genes for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the start codons
- modality.annotation.get_three_prime_utrs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
Retrieve the 3’ UTRs of protein-coding genes for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the 3’ UTRs
- modality.annotation.get_transcription_end_region(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, start_offset: int = 0, span: int = 0, as_pyranges: bool = True)#
Fetch protein-coding transcription END sites from GENCODE (within a given range).
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
protein_coding – Whether to return only protein-coding genes.
filterby – Dictionary of filters to apply.
start_offset – Offset from transcription end site.
span – Span of transcription end site region.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the transcription end regions
- modality.annotation.get_transcripts(reference: str | None = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#
For a given range, return the transcripts of protein-coding genes.
- Parameters:
reference – Reference name.
contig – Contig name. If None, all contigs are returned.
start – Start position. If None, all positions are returned.
end – End position. If None, all positions are returned.
strand – Strand.
as_pyranges – whether to return a PyRanges instead of a pandas DataFrame
- Returns:
(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the transcripts
- modality.annotation.get_tss_dist(reference: str, contig: str | None = None, ref_pos: DataArray | ndarray | list | None = None, strand: str = 'both', return_nearest: bool = True) tuple[ndarray, ndarray] | ndarray #
Retrieve the distance to the TSS for a given range.
- Parameters:
reference – Reference name.
contig – Contig name. The default is None, which will return the TSS distance for all contigs.
ref_pos – Reference position.
strand – Strand. If “both”, return the nearest TSS, regardless of strand. If “+” or “-”, return the TSS on that strand.
return_nearest – Whether to return the nearest TSS.
- Returns:
tuple with NumPy arrays of nearest and TSS distances or a NumPy array of TSS distances