API reference#

modality.annotation module#

modality.annotation.conditional_dataframe_to_pyranges(dataframe: DataFrame, convert: bool, to_drop: list | None = None)#

Convert dataframe to PyRanges if convert is True, optionally dropping the to_drop columns.

Parameters:
  • dataframe – dataframe to convert

  • convert – whether to execute the conversion

  • to_drop – columns to drop

Returns:

(Union[pd.DataFrame, pr.PyRanges]) the original dataframe or generated PyRanges

modality.annotation.dataframe_to_pyranges(df: DataFrame, mapping: dict | None = None)#

Convert a dataframe to a pyranges instance.

Parameters:
  • df – dataframe to convert

  • mapping – Mapping of annotation standard to pyranges standard.

Returns:

(pyranges.PyRanges) generated Pyranges instance.

modality.annotation.get_annotation(reference: str | None, contig: str | None = None, start: int | None = None, end: int | None = None, annotation_type: str | None = None, filterby: dict | None = None, exact_match: bool = False, as_pyranges: bool = True)#

Generic function for fetching annotation from GENCODE. All get_* functions should call this function.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • annotation_type – Annotation type.

  • filterby – Dictionary of filters to apply.

  • exact_match – Whether to apply exact matching to filters.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the requested annotation

modality.annotation.get_cds(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

Retrieve the coding sequences of protein-coding genes for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the coding sequences

modality.annotation.get_cpg_islands(reference: str = 'hg38', contigs: list | None = None, as_pyranges: bool = True)#

Retrieve the CpG islands in contigs.

Parameters:
  • reference – Reference name.

  • contigs – List of contigs for which to return CpG islands.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the CpG islands

modality.annotation.get_cpg_shelves(reference: str = 'hg38', contigs: list | None = None)#

Retrieve CpG shelves for given contigs as a PyRanges instance. Shelves: [-4000 + start, -2000 + start] + [end + 2000, end + 4000].

Parameters:
  • reference – Reference name.

  • contigs – List of contigs for which to return CpG shelves.

Returns:

(pyranges.PyRanges) PyRanges containing the CpG shelves

modality.annotation.get_cpg_shores(reference: str = 'hg38', contigs: list | None = None)#

Retrieve CpG shores for given contigs as a PyRanges instance. Shores: [-2000 + start, start] + [end, end + 2000].

Parameters:
  • reference – Reference name.

  • contigs – List of contigs for which to return CpG shores.

Returns:

(pyranges.PyRanges) PyRanges containing the CpG shores

modality.annotation.get_exons(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

Retrieve exons of protein-coding genes for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the exons

modality.annotation.get_five_prime_utrs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

Retrieve the 5’ UTRs of protein-coding genes for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the 5’ UTRs

modality.annotation.get_gencode_url(reference: str) str#

Generate the URL for the GENCODE annotation file for a given reference, updated Spring 2023.

https://www.gencodegenes.org/human/releases.html https://www.gencodegenes.org/mouse/releases.html

Parameters:

reference – reference name

Returns:

GENCODE URL

modality.annotation.get_genes(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, as_pyranges: bool = True)#

Fetch gene annotation ranges from GENCODE within a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • protein_coding – Whether to return only protein-coding genes.

  • filterby – Dictionary of filters to apply.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the genes

modality.annotation.get_introns(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, as_pyranges: bool = True, strand: str | None = None, transcripts: list | None = None, nb_workers: int | None = None)#

Retrieve the introns of protein-coding genes for a given range. The implementation is based on the following: - GENCODE annotation has many transcripts per gene - For each transcript there are several exons, and usually one 5’UTR and one 3’UTR. - For each transcript_id, grab all the exons, 5’ and 3’ UTRs and merge them into a pyranges object. - This pyranges object will contain a gene_id for the gene associated with the transcript_id. So we can grab the relevant gene. - Now we have the start and end position of the gene, and the pyranges object, so we can use the subtract method from pyranges to get the introns as genes - (exons + 5’ + 3’ UTRs).

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • protein_coding – Whether to return only protein-coding genes.

  • filterby – Dictionary of filters to apply.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

  • strand – Strand.

  • exact_match – Whether to apply exact matching to filters.

  • transcripts – List of transcripts to consider. This is optional, but will make the function faster if you know the transcripts you are interested in.

  • nb_workers – Number of workers to use for parallel processing. Default to None to use all available cores.

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the introns.

Note: this function is slow, and we recommend preselecting the transcripts you are interested in.

modality.annotation.get_lncRNAs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

Retrieve the long non-coding RNA genes for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the lncRNA genes

modality.annotation.get_miRNAs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

Retrieve the microRNA genes for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the miRNA genes

modality.annotation.get_promoters(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, as_pyranges: bool = True)#

Retrieve the promoter ranges of protein coding genes - defined as 1KB upstream of the TSS.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • protein_coding – Whether to return only protein-coding genes.

  • filterby – Dictionary of filters to apply.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the promoter ranges

modality.annotation.get_protein_coding_def() dict#
modality.annotation.get_protocadherin_cluster(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

See e.g. https://genome.ucsc.edu/cgi-bin/hgGene?db=hg38&hgg_chrom=chr5&hgg_gene=ENST00000194152.4&hgg_start=141121817&hgg_end=141125622&hgg_type=knownGene

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the protocadherin clusters

modality.annotation.get_start_codons(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

Retrieve the start codons of protein-coding genes for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the start codons

modality.annotation.get_three_prime_utrs(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

Retrieve the 3’ UTRs of protein-coding genes for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the 3’ UTRs

modality.annotation.get_transcription_end_region(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, start_offset: int = 0, span: int = 0, as_pyranges: bool = True)#

Fetch protein-coding transcription END sites from GENCODE (within a given range).

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • protein_coding – Whether to return only protein-coding genes.

  • filterby – Dictionary of filters to apply.

  • start_offset – Offset from transcription end site.

  • span – Span of transcription end site region.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the transcription end regions

modality.annotation.get_transcripts(reference: str | None = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, strand: str | None = None, as_pyranges: bool = True)#

For a given range, return the transcripts of protein-coding genes.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • strand – Strand.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the transcripts

modality.annotation.get_tss_dist(reference: str, contig: str | None = None, ref_pos: ndarray = None, strand: str = 'both', return_nearest: bool = True) tuple[ndarray, ndarray] | ndarray#

Retrieve the distance to the TSS for a given range.

Parameters:
  • reference – Reference name.

  • contig – Contig name. The default is None, which will return the TSS distance for all contigs.

  • ref_pos – Reference position.

  • strand – Strand. If “both”, return the nearest TSS, regardless of strand. If “+” or “-”, return the TSS on that strand.

  • return_nearest – Whether to return the nearest TSS.

Returns:

tuple with NumPy arrays of nearest and TSS distances or a NumPy array of TSS distances

modality.annotation.get_tss_region(reference: str = 'hg38', contig: str | None = None, start: int | None = None, end: int | None = None, protein_coding: bool = True, filterby: dict | None = None, start_offset: int = 0, span: int = 0, as_pyranges: bool = True)#

Fetch gene transcription START site regions from GENCODE.

Parameters:
  • reference – Reference name.

  • contig – Contig name. If None, all contigs are returned.

  • start – Start position. If None, all positions are returned.

  • end – End position. If None, all positions are returned.

  • protein_coding – Whether to return only protein-coding genes.

  • filterby – Dictionary of filters to apply.

  • start_offset – Offset from TSS.

  • span – Span of TSS region.

  • as_pyranges – whether to return a PyRanges instead of a pandas DataFrame

Returns:

(Union[pr.PyRanges, pd.DataFrame]) PyRanges or pandas DataFrame containing the transcription start site regions

modality.annotation.get_ucsc_df(rest_url: str, contigs: list | None = None) DataFrame#

Parse the json returned by a UCSC REST API query to a dataframe.

Parameters:
  • rest_url – URL string for the REST API query.

  • contigs – List of contigs to return.

Returns:

dataframe of CpG islands

modality.annotation.get_ucsc_rest_url(data_type: str = 'track', reference: str = 'hg38', key_values: str = 'track=cpgIslandExt') str#

Return the URL string necessary to query the UCSC REST API.

Parameters:
  • data_type – Data type to query.

  • reference – Reference name.

  • key_values – Key-value pairs to pass to the REST API.

Returns:

URL string for the REST API query.

modality.annotation.pyranges_columns_mapping() dict#

Common conversion from annotation standard to pyranges standard.

Returns:

Mapping of annotation standard to pyranges standard.

modality.annotation.pyranges_to_dataframe(dp, mapping: dict | None = None) DataFrame#

Cast from a pyranges instance to a dataframe. Optionally, provide a mapping to rename columns by.

Parameters:
  • dp (pyranges.PyRanges) – Pyranges instance.

  • mapping – Mapping of annotation standard to pyranges standard.

Returns:

generated DataFrame

modality.annotation.standardise_contig(contig: str, reference: str) str#

Conversion of named contig keys from biomodal standard to those used by GENCODE and UCSC.

Parameters:
  • contig – Contig name.

  • reference – Reference name.

Returns:

Equivalent contig name.

modality.contig_dataset module#

class modality.contig_dataset.ContigDataset(ds: Dataset | None = None, data_vars: dict | None = None, coords: dict | None = None, attrs: dict | None = None)#

Bases: MutableSequence, ExporterMixin, AggregatorMixin

Methylation context-specific xarray-like dataset.

A wrapper around an xarray.Dataset instance that extends its functionality.

aggregate_bygroup(group: str | List[str], dim: str = 'sample_id', inplace: bool = False)#

Aggregate (sum) data variables across a pre-defined group. This function is to be used to aggregate across biological replicates or similar. For example:

meta['RepGroup'] = np.random.choice(['LabA', 'LabB'], len(meta))
meta['age'] = np.random.randint(1, 100, len(meta))
ds = ds.merge_meta(meta.reset_index(names='sample_id'))
grouped = ds.aggregate_bygroup('RepGroup')

Currently, this function only supports aggregation using sum(); in some cases it may be more appropriate to use the mean, e.g. for age and other metadata. In this case, the mean can be calculated after aggregation using the group_nmember key.

Parameters:
  • group (str or list) – The group(s) to aggregate by. This can be a single group or a list of groups.

  • dim (str) – The dimension to aggregate over. This is usually “sample_id” but can be changed if required.

  • inplace (bool) – Whether to modify the dataset in place or return a new dataset.

Returns:

The modified ContigDataset.

Return type:

self (contig_dataset.ContigDataset)

Note

This function will remove the “sample_id” key. This is used by several other internal functions, the user may want to restore the sample ID key using the group key, e.g. ds = ds.assign_coords(sample_id = (“group”, ds[“my_group”].data)).

assign_fractions(numerators: str | list, denominator: str, min_coverage: int, inplace: bool = False)#

Add columns to the ContigDataset with the fractions of each of the numerators columns to denominator, named after the pattern “frac_[numerator column]”.

The parameters are stored in the ContigDataset attributes as “frac_denominator” and “frac_min_coverage”.

Parameters:
  • numerators – the columns of which fractions to the denominator to be calculated e.g. “num_mc”, “num_modc”

  • denominator – the column to be used as the denominator of the fractions, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the fraction denominator required for a position to be included in the fraction calculation. positions with coverage bellow min_coverage are set to np.nan

  • inplace – boolean of whether to modify the object in place

Returns:

copy of self with added fraction columns or None if self is modified inplace

assign_text_nucleotide_contexts(trinucleotide_context_column)#
at_mincoverage(minimum=20)#

Select a subset of positions from the dataset that meet a minimum coverage threshold.

Parameters:

minimum (int or float) – the minimum coverage threshold.

Returns:

the ContigDataset instance with the selected positions.

Return type:

self

compute_correlation_matrix(numerator: str, denominator: str, min_coverage: int)#

Compute a correlation matrix across samples.

Parameters:
  • numerator – the numerator variable

  • denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation

Returns:

The correlation matrix.

compute_histogram(var='frac_mc', bins=26, norm_hist=False)#

Compute a histogram from the data in the ContigDataset. This function assumes that the data contains a sample_id dimension. It is required that “sample_id” is present as “hue”, “row” or “col” in the kwargs. If not set, then by default the “sample_id” will be used as the “hue”. If hue is already set, then the code will raise a ValueError.

Additional categorical covariates that are present in the sample_id dimension can be used as “hue”, “row” or “col” in the histogram.

Parameters:
  • var – the variable to plot

  • bins – the number of bins or an array of bins. IMPORTANT: the auto-binning may not return an array of bin edges with the desired number of bins. If you want full control over number of bins, you should pass an array of bin edges.

  • norm_hist – whether to normalise the histogram

Returns:

the count array bins: the bins

Return type:

count_array

compute_pearson_matrix(numerator: str, denominator: str, min_coverage: int, groupby: str = 'group')#

Compute a Pearson correlation matrix. This function calculates the Pearson correlation coefficient between the methylation fractions across samples.

Parameters:
  • numerator – the numerator variable

  • denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation

  • groupby – label by which to aggregate the ContigDataset

  • **kwargs – the keyword arguments passed to plt.subplots()

Returns:

The matrix of Pearson correlation coefficients.

ds#
classmethod from_dict(vars_dict: dict)#
classmethod from_run_configs(run_config_list, merge_cpgs=False, merge_meta=True, merge_annotation=False)#
classmethod from_zarrz(fpath: str, **kwargs) ContigDataset#

Load the zipped zarr store or zarr store directory in fpath as a ContigDataset.

See: https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html

Parameters:
  • fpath – the path to the zarr store

  • kwargs – the keyword arguments to be passed to xr.open_zarr

Returns:

the loaded ContigDataset

get_temporary_name()#
granges2rows(ranges)#

Return equivalent genomic ranges encoded as DataArray row indices.

Parameters:

ranges (PyRanges) – the ranges, as a PyRanges object.

Returns:

the equivalent genomic ranges encoded as DataArray row indices.

Return type:

result (np.array)

hstack(rest: ~modality.contig_dataset.ContigDataset | list[~modality.contig_dataset.ContigDataset], combine_attrs: ~typing.Callable = <function combine_dataset_attrs>, **kwargs) ContigDataset#

Stack ContigDatasets by sample_id.

See also: https://docs.xarray.dev/en/stable/generated/xarray.concat.html

Parameters:
  • rest – the ContigDataset(s) to be stacked with self

  • combine_attrs – the function to be used to combine the attributes

  • **kwargs – the keyword arguments to be passed to xr.concat

Returns:

the generated stacked ContigDataset

insert()#

S.insert(index, value) – insert value before index

items()#
merge_annotation(annotation, pos='ref_position')#

Merge the position-based annotation data with the ContigDataset.

Parameters:
  • annotation (str) – the path to the annotation data.

  • pos (str) – the position column to be used for merging.

Returns:

the ContigDataset instance with the annotation data merged.

Return type:

self (contig_dataset.ContigDataset)

merge_meta(metadata: DataFrame | str | Path, coord_columns=None, **kwargs)#

Add additional coordinates to the dataset based on per-sample metadata. A sample_id column is required

Parameters:
  • metadata – Pandas DataFrame containing the metadata or path to the metadata file.

  • coord_columns (list) – the columns to be merged.

  • kwargs – additional keyword arguments to be passed to the read_csv function.

Returns:

the ContigDataset instance with the additional coordinates.

Return type:

self (contig_dataset.ContigDataset)

plot_correlation_heatmaps(numerator: str, denominator: str, min_coverage: int, title: str | None = None, **kwargs)#

Plot a correlation across samples, as a heatmap of the methylation fractions, for each pair of samples.

Parameters:
  • ds – The dataset to be plotted.

  • var – The variable to be plotted.

  • min_coverage – The minimum coverage.

  • **kwargs – The keyword arguments passed to plt.subplots() or plt.pcolormesh(). These include: - figsize: tuple, size of the figure (width, height) in inches - cmap: str, the colormap - vmin: float, the minimum value of the colormap - vmax: float, the maximum value of the colormap

Returns:

The figure object. axes: The axes object.

Return type:

fig

plot_correlation_scatter(numerator: str, denominator: str, min_coverage: int, **kwargs)#

Plot the correlation across samples, as a scatter plot of the methylation fractions for each pair of samples.

Parameters:
  • numerator – the numerator variable

  • denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation

  • **kwargs – Keyword arguments for customizing the plot. These include: Subplots arguments: - figsize: tuple, size of the figure (width, height) in inches - sharex, sharey: bool or {‘none’, ‘all’, ‘row’, ‘col’}, share x or y axes among subplots Scatter plot arguments: - s: float or array-like, size of the markers - alpha: float, transparency of the markers Any other keyword arguments accepted by plt.subplots() or plt.scatter()

Returns:

The figure and axes objects.

plot_coverage_distribution(var='num_total_c', bins=None, norm_hist=False, num_max_tick_labels=10, return_data=False, **kwargs)#

Plot the distribution of coverage across samples.

Parameters:
  • var – the variable to plot. Most likely one of [“num_total_c”].

  • bins – the number of bins or an array of bins. IMPORTANT: the auto-binning may not return an array of bin edges with the desired number of bins. If you want full control over number of bins, you should pass an array of bin edges.

  • norm_hist – whether to normalise the histogram

  • num_max_tick_labels – the maximum number of tick labels on the x-axis.

  • return_data – whether to return the data used to plot the histogram

  • **kwargs – the keyword arguments passed to sns.lineplot().

plot_histogram(var='frac_mc', bins=26, norm_hist=False, num_max_tick_labels=7, return_data=False, **kwargs)#

Plot a histogram from the data in the ContigDataset.

Parameters:
  • var – the variable to plot

  • bins – the number of bins or an array of bins. IMPORTANT: the auto-binning may not return an array of bin edges with the desired number of bins. If you want full control over number of bins, you should pass an array of bin edges.

  • norm_hist – whether to normalise the histogram

  • num_max_tick_labels – the maximum number of tick labels on the x-axis.

  • return_data – whether to return the data used to plot the histogram

  • **kwargs – the keyword arguments passed to sns.catplot()

Returns:

the catplot object

Return type:

g

plot_methylation_tile(numerator: str, denominator: str, min_coverage: int, sample_label: str = 'sample_id', position_covariate: None | CovariateObject = None, title: str | None = 'auto', **kwargs)#

Plot a methylation tile plot. This plot is useful for displaying structure present in the cohort. This plot is intended for small regions of the genome, as it will load all the data into memory. One can access such small region by subsetting the dataset, e.g., doing ds[f”{chr}:{start}-{end}”] or calling subset_byranges with a single range.

Parameters:
  • numerator – the numerator variable

  • denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation

  • sample_label – the label to be used for the sample if not ‘sample_id’

  • position_covariate – the position covariate to be used

  • title – the title of the plot

  • **kwargs – the keyword arguments passed to plt.imshow()

plot_methylation_trace(numerators: List[str], denominator: str, min_coverage: int, group: str = 'group', rolling_window_size: int = 100, title: str | None = None, facet_kws: dict | None = None, **kwargs)#

Plot methylation fraction trace across a genomic region. This function is intended to use on a small region of the genome, as it will load all the data into memory. One can access such small region by subsetting the dataset, e.g., doing ds[f”{chr}:{start}-{end}”] or calling subset_byranges with a single range.

Parameters:
  • numerators – List of numerator variables, can be single or multiple. Also accepts a single str comma separated.

  • denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation

  • group – the group to be plotted

  • rolling_window_size – the size of the rolling window for smoothing

  • title – the title of the plot

  • facet_kws – the keyword arguments passed to sns.relplot() as facet_kws to the FacetGrid, e.g sharey. If None, defaults to {‘sharey’: False}.

  • **kwargs – The keyword arguments passed to sns.relplot().

  • 'group' (Note that 'hue' is set to)

  • 'line'. (and kind is set to)

Returns:

The figure object.

plot_pearson_matrix(numerator: str, denominator: str, min_coverage: int, groupby: str = 'group', **kwargs)#

Plot a Pearson correlation matrix. This function calculates the Pearson correlation coefficient between the methylation fractions across samples, using the compute_pearson_matrix function, and plots the matrix as a heatmap.

Parameters:
  • numerator – the numerator variable

  • denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation

  • groupby – label by which to aggregate the ContigDataset

  • **kwargs – additional keyword arguments that can be passed to customize the plot. These include: - Arguments for plt.subplots(): nrows, ncols, sharex, sharey, figsize - Axis customization: xlabel, ylabel, xscale, yscale, xlim, ylim, title - Figure customization: suptitle - Seaborn heatmap customization: vmin, vmax, cmap, alpha, annot, square, cbar_kws - Any other keyword arguments accepted by sns.heatmap()

Returns:

The figure and axes objects.

plot_six_base_ternary(denominator: str = 'num_total', min_coverage: int = 10, cmap: ~matplotlib.colors.LinearSegmentedColormap = <matplotlib.colors.LinearSegmentedColormap object>, gridsize: int = 50, figsize=None, include_colorbar: bool = True, **kwargs)#

Creates a ternary plot of 6-base data, showing the distribution of C, 5mC, and 5hmC fractions across samples.

Parameters:
  • denominator – the denominator variable, one of [“num_total”, “num_total_c”], the sum of all bases or all Cs respectively

  • min_coverage – the minimum coverage of the denominator required for a position to be included in the fraction calculation

  • cmap – the colormap to use

  • gridsize – the number of hexagons in the x-direction

  • figsize – the size of the figure

  • include_colorbar – whether to include a colorbar, showing the logged counts of contexts in a given bin.

  • **kwargs – the keyword arguments passed to plt.figure()

Returns:

The figure and axes objects.

reduce_bybins(column: Array | ndarray, bins: Array | ndarray, var: str | List[str], counts_only: bool = False, sum_only: bool = False, chunks: int = 100000) Dataset#

Reduce positions by binning column by provided bins. This is useful for looking at the distribution of data variables across a continuous variable such as distance to nearest TSS or CpG richness.

Parameters:
  • column – an array of length equal to the ContigDataset position dimension with the values by which to reduce the samples

  • bins – an array of bin values to use for grouping the values in column. Note: bins must be explicitly provided

  • var – data variable(s) to reduce

  • counts_only – return only the count of data entries

  • sum_only – return only the sum of data entries

  • chunks – column and bins chunks size

Returns:

xarray Dataset with the ContigDataset data reduced by column. Includes bin_start, bin_end and bin_midpoint.

reduce_byranges(ranges: PyRanges | List[PyRanges], var: str | List[str], counts_only: bool = False, sum_only: bool = False, min_count: int = 1, ranges_are_1based: bool = False, merge_range_intervals: bool = False, store_result_to_disk=True)#

Reduce data variables by the provided genomic ranges in order to gain a summarized view of the data over ranges. Returns a new xarray.Dataset with the data summarized over the provided ranges. This function computes results immediately, and is NOT a deferred operation. Important: Ranges are assumed to be 0 based. If ranges are 1 based, set ranges_are_1based=True.

Parameters:
  • ranges (pr.PyRanges or List[pr.PyRanges], required) – PyRanges object(s) to reduce by. If a list of PyRanges objects is provided, they will be concatenated before reducing.

  • var (str or List[str], required) – Data variable(s) to reduce. If a list of data variables is provided, they will be reduced separately.

  • counts_only (bool, optional) – If True, only the count of data entries will be returned. Default is False.

  • sum_only (bool, optional) – If True, only the sum of data entries will be returned. Default is False.

  • min_count (int, optional) – Minimum number of data entries required in each interval of range to calculate mean. Default is 1.

  • ranges_are_1based (bool, optional) – If True, ranges are assumed to be 1-based. Default is False (i.e. 0 based).

  • merge_range_intervals (bool, optional) – If True, intervals within the ranges will be merged if they overlap. Default is False.

  • store_result_to_disk (bool, optional) – If True, the result will be stored to disk and available via a dask deferred operation. Default is True. Recommended to leave as True, unless the number of regions is small (<500_000).

Returns:

Dataset with the data summarized over the provided ranges. The dataset will contain the following variables: - For each data variable in var:

  • <data_variable>_sum: the sum of data entries for the given range.

  • <data_variable>_mean: the mean of data entries for the given range.

  • <data_variable>_cpg_count: the non-NaN count of contexts for the given range.

  • Additional variables from the ranges dataframe.

Return type:

xr.Dataset

Note

<data_variable>_cpg_count returns the count of non-NaN data entries for the given range. If you pass a list of two variables to var, for instance num_modc and frac_modc, the corresponding <data_variable>_cpg_count could be different for each variable. This is because frac_modc is generated by ds.assign_fractions(…, min_converage=X) and contains NaN values where the coverage is below X. In this case, the <data_variable>_cpg_count for num_modc will be the same as the <data_variable>_cpg_count for frac_modc where the coverage is above X, otherwise, it will be different

rename_samples(**kwargs)#

Renames sample IDs with various available methods

Parameters:

None

Keyword Arguments:
  • names (# for mapping old to new)

  • mapping (dict) – A dictionary mapping old IDs to new IDs.

  • ID (# for applying a function to each sample)

  • func (callable) – A function to transform each sample ID.

  • IDs (# for removing a common prefix from all sample)

  • old_substr (str) – A substring to be replaced.

  • new_substr (str) – The replacement substring. Must be specified, i.e. does not default to “”

  • count (int, optional) – Number of occurrences to replace. Defaults to all.

  • IDs

  • remove_common_prefix (bool, optional) – If True, removes the common prefix.

  • prefix (str, optional)

  • reset (bool, optional) – If True, resets IDs to “SAMPLE_1”, “SAMPLE_2”, etc.

  • prefix – Prefix to use when resetting IDs. Defaults to “SAMPLE”.

Returns:

The modified ContigDataset with updated sample ids.

Return type:

self (contig_dataset.ContigDataset)

Note: - Order of precedence is mapping, func, (old_substr, new_substr), remove_common_prefix, reset. - Previous versions of modality allowed a list of sample IDs to be passed to this method.

This has been removed to avoid potential ordering bugs.

  • The ContigDataset is modified in place, but self is returned to allow chaining.

select_samples(sample_list)#

Select a subset of samples from the dataset.

Parameters:

sample_list (str or list) – the sample or list of samples to be selected.

Returns:

the ContigDataset instance with the selected samples.

Return type:

self (contig_dataset.ContigDataset)

property slices: dict#

Returns: generated slices dictionary

subset_bycoverage(min_coverage: int = 0, max_coverage=None, coverage_array='num_total_c', method='all', write_to_disk: bool = False, output_path=None)#

Subset the Contig dataset by coverage, returning a new dataset with only the positions that fall within the provided coverage range. Remove positions where the coverage is strictly lower than min_coverage and strictly higher than max_coverage if provided. The removal strategy is determined by the method parameter.

Parameters:
  • min_coverage (int) – the minimum coverage required for a position to be included in the subset.

  • max_coverage (int) – the maximum coverage required for a position to be included in the subset.

  • coverage_array (str) – the data variable to use for coverage.

  • method (str) – the method to use for determining coverage. One of [“all”, “any”, “mean”]. - “all” removes the position only if all the samples have a coverage strictly lower than min_coverage and strictly higher than max_coverage if provided. - “any” removes the position if any of the samples have a coverage strictly lower than min_coverage and strictly higher than max_coverage if provided. - “mean” removes the position if the mean coverage across all samples is strictly lower than min_coverage and strictly higher than max_coverage if provided.

  • write_to_disk (bool) – whether to write the subset to disk.

  • output_path (str) – the path to write the subset to.

Returns:

the subset ContigDataset.

Return type:

ds (ContigDataset)

Note

If write_to_disk is True, the function will write the subset to disk and return a new ContigDataset read from the written file. If write_to_disk is False, the function will return the subset ContigDataset without writing to disk.

subset_byranges(ranges: PyRanges | List[PyRanges], ranges_are_1based: bool = False, merge_range_intervals: bool = False, write_to_disk: bool = False, output_path=None)#

Subset the dataset by ranges, returning a new dataset with only the positions that fall within the provided ranges. Important: Ranges are assumed to be 0 based. If ranges are 1 based, set ranges_are_1based=True.

Parameters:
  • ranges (pr.PyRanges or List[pr.PyRanges]) – PyRanges object(s) to subset by. If a list of PyRanges objects is provided, they will be concatenated before reducing.

  • ranges_are_1based (bool, optional) – If True, ranges are assumed to be 1-based. Default is False (i.e. 0 based).

  • merge_range_intervals (bool, optional) – If True, intervals within the ranges will be merged if they overlap. Default is False.

  • write_to_disk (bool, optional) – Whether to write the subset to disk.

  • output_path (str, optional) – The path to write the subset to.

Returns:

the subset ContigDataset.

Return type:

subset_data (ContigDataset)

Note

If write_to_disk is True, the function will write the subset to disk and return a new ContigDataset read from the written file. If write_to_disk is False, the function will return the subset ContigDataset without writing to disk.

summary(contigs=None, contig_mapping='^(.+)_.+', autosomes=None)#

Returns a summary of the dataset in the form of a multi index dataframe.

And also a CpG coverage stats file with the number of CpGs at each coverage depth.

summary_cpg_coverage()#

Return a dataframe with the number of CpGs at each coverage depth.

to_bedgraph(sample_ids: list | None = None, modification_count_column: str = 'num_modc', denominator_column: str = 'num_total_c', additional_columns: list | None = None, tag: str = '', file_template: str = '{{sample_id}}.{tag}{{var}}.bdg.gz', output_dir: str = '.', skip_rows_zero_counts: bool = True, tabix_index: bool = False)#

Write data to a bedGraph format: https://genome.ucsc.edu/goldenPath/help/bedgraph.html

A bedgraph contains 4 required columns: chromosome, start, end, value. The value is the quantification value. We can append additional columns to the bedgraph, to provide further information.

The bedGraph format contains a header line that begins with the word “track” followed by a space-separated list of key-value pairs. The key-value pairs are of the form key=value. The track line must end with a newline character.

>>> from modality.mock_contig_dataset import make_mock_5hmc_cds
>>> cds = make_mock_5hmc_cds({"chr1": 1_000_000, "chr2": 1_000_000})
>>> cds = cds.assign_fractions(['num_hmc', 'num_mc'], denominator='num_total_c', min_coverage=0)
>>> bedgraphs = cds.to_bedgraph(output_dir="/tmp")
Parameters:
  • sample_ids – sample IDs to export bedGraphs for. Note: A separate file will be created for each sample. If None, all samples are exported.

  • modification_count_column – the data variable to export. If you are working with evoC data, this will typically be num_hmc for 5hmC or num_mc for 5mC.

  • denominator_column – the column to use to calculate the methylation fraction, and to use as the coverage. Default: “num_total_c”, if you wish to include all bases use “num_total”.

  • additional_columns – Extra data variable(s) to export. Both strings and list arguments are supported. Note 1: providing multiple column names will result in the bedGraph containing >4 columns and some downstream tools may not support this.

  • file_template – the template for the file name. The sample_id and value_column will be inserted into the template. Default: “{sample_id}.{var}.bdg.gz”

  • output_dir – path to output directory

  • skip_rows_nan_values – whether to skip rows with missing/nan values in the value_column. Default: True

  • tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False

Returns:

A list of created file paths.

Notes

An uncompressed output is not supported due to write speed and file size considerations. If you

require an uncompressed output, please use gunzip post-processing.

to_bedmethyl(sample_ids=None, modification_count_column: str = 'num_modc', denominator_column: str = 'num_total_c', tag: str = '', file_template: str = '{{sample_id}}.{tag}{{var}}.bed.gz', output_dir: str = '.', skip_rows_zero_counts: bool = True, tabix_index: bool = False)#

Write data to a bedmethyl format. A 11 column tab-separated file format that contains the following columns:

  • Reference chromosome or scaffold

  • Start position in chromosome

  • End position in chromosome

  • Name of item

  • Score from 0-1000. Capped number of reads

  • Strandedness, plus (+), minus (-), or unknown (.)

  • Start of where display should be thick (start codon)

  • End of where display should be thick (stop codon)

  • Color value (RGB)

  • Coverage, or number of reads

  • Percentage of reads that show methylation at this position in the genome

https://www.encodeproject.org/data-standards/wgbs/

Parameters:
  • sample_ids – sample IDs to export cxreports for. Note: A separate file will be created for each sample. If None, all samples are exported.

  • modification_count_column – the data variable to export. If you are working with evoC data, this will typically be num_hmc for 5hmC or num_mc for 5mC.

  • denominator_column – the column to use to calculate the methylation fraction, and to use as the coverage. Default: “num_total_c”, if you wish to include all bases use “num_total”.

  • file_template – the template for the file name. The sample_id and count_methylated_column will be inserted into the template. Default: “{sample_id}.{var}_cxreport.txt.gz”

  • output_dir – path to output directory

  • skip_rows_zero_counts – whether to skip rows with zero counts in the denominator column. Default: True

  • tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False

Returns:

A list of created file paths.

Notes

An uncompressed output is not supported due to write speed and file size considerations. If you require an

uncompressed output, please use gunzip post-processing.

to_bismark_cov_report(sample_ids=None, modification_count_column: str = 'num_modc', denominator_column: str = 'num_total_c', tag: str = '', file_template: str = '{{sample_id}}.{tag}{{var}}_bismark.txt.gz', output_dir: str = '.', skip_rows_zero_counts: bool = True, tabix_index: bool = False)#

Write data to a bismark coverage report format, see: https://support.illumina.com/help/BaseSpace_App_MethylSeq_help/Content/Vault/Informatics/Sequencing_Analysis/Apps/swSEQ_mAPP_MethylSeq_Bismark_Coverage.htm This is a 6 column tab-separated file format that contains the following columns: <chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>

Parameters:
  • sample_ids – sample IDs to export files for. Note: A separate file will be created for each sample. If None, all samples are exported.

  • modification_count_column – the data variable to export. If you are working with evoC data, this will typically be num_hmc for 5hmC or num_mc for 5mC.

  • denominator_column – the column to use to calculate the methylation fraction, and to use as the coverage. Default: “num_total_c”, if you wish to include all bases use “num_total”.

  • file_template – the template for the file name. The sample_id and count_methylated_column will be inserted into the template. Default: “{sample_id}.{var}_bed.txt.gz”

  • output_dir – path to output directory

  • skip_rows_zero_counts – whether to skip rows with zero counts in both methylation columns. Default: True

  • tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False

Returns:

A list of created file paths.

Notes

Methylation percentage is calculated as 100 * count_methylated / denominator_column. The <count unmethylated> column as described in the bismark format is calculated as denominator -

count_methylated. In a 3 state system (m, h, u) this column will therefore contain the sum of unmodified bases and bases with ANOTHER modification to that specified in the modification_count_column.

An uncompressed output is not supported due to write speed and file size considerations. If you require an

uncompressed output, please use gunzip post-processing.

to_quant(sample_ids=None, tag: str = '', file_template: str = '{{sample_id}}.{tag}{{var}}.quant.tsv.gz', mode='infer', output_dir: str = '.', skip_rows_zero_counts: bool = True, modc_degenerate_in_evoC: bool = True, tabix_index: bool = False)#

Write data to a biomodal quant report format. A 6 or 8 column tab-separated file format that contains the following columns:

# contig ref_position num_modC num_mC num_hmC num_C num_total strand for evoC

# contig ref_position num_modC num_C num_total strand for +modC

Parameters:
  • sample_ids – sample IDs to export cxreports for. Note: A separate file will be created for each sample. If None, all samples are exported.

  • file_template – the template for the file name. The sample_id and count_methylated_column will be inserted into the template. Default: “{sample_id}.{var}.quant.txt.gz”

  • mode – the mode to export. Accepts “evoC” or “modC” or “infer”. Default: “infer” where the mode is inferred based on the presence of the num_hmc column.

  • output_dir – path to output directory

  • skip_rows_zero_counts – whether to skip rows with zero counts in the num_total column. Default: True

  • modc_degenerate_in_evoC – in evoC i.e. 6 base whether the modc column is degenerate, ie modified but uncertain (m or h). If False, modC is the superset of m, h and (m or h).

  • tabix_index – If True, compresses the files using bgzip (bgzip/tabix must be installed on the system). If bgzip compression fails or is not available, the files are compressed using gzip. If False, files are compressed using gzip. Default: False

Returns:

A list of created file paths.

Notes

An uncompressed output is not supported due to write speed and file size considerations. If you require an

uncompressed output, please use gunzip post-processing.

to_zarrz(fpath: str, zipped: bool = True, overwrite: bool = False)#

Write the ContigDataset to a zarr store directory or zipped zarr store file.

Parameters:
  • fpath – the path to the zarr store

  • zipped – whether to use a zipped zarr store. Default is True.

  • overwrite – whether to overwrite an existing zarr store. Default is False.

Returns:

None

vstack(rest, dim='pos')#

Stack ContigDatasets by contig.

Parameters:
  • rest (ContigDataset or list) – the ContigDatasets to be stacked.

  • dim (str) – the dimension to be stacked.

Returns:

the stacked ContigDataset.

Return type:

new (contig_dataset.ContigDataset)

See:

https://docs.xarray.dev/en/stable/generated/xarray.concat.html

where(cond, other=-1, drop=False)#

Apply a condition to the dataset.

Parameters:
  • cond (bool) – the condition to be applied.

  • other – the value to be used where the condition is False.

  • drop (bool) – whether to drop the dimension where cond is False.

Returns:

the ContigDataset instance with the condition applied.

Return type:

self (contig_dataset.ContigDataset)

See:

https://xarray.pydata.org/en/v0.8.2/generated/xarray.Dataset.where.html

modality.contig_dataset.cast_result(result)#
modality.contig_dataset.cast_xrattr(function)#

Catch the result of an xarray.dataset attribute call and cast it according to its type, e.g. xr.dataset -> ContigDataset.

Parameters:

function (function) – the function to be wrapped.

Returns:

the wrapped function.

Return type:

wrapped_function (function)

modality.contig_dataset.combine_dataset_attrs(attrs_list: List[dict], context: str) dict#

Merge the attribute properties from a sequence of xarray objects or ContigDataset wrappers. This function creates an attribute dictionary with combined values, such as a list of sample_ids from the [sample_id] attribute present in the individual stores.

Note: must expect a sequence of attrs dicts and a context object as its only parameters.

See:

https://docs.xarray.dev/en/stable/generated/xarray.concat.html https://docs.xarray.dev/en/stable/generated/xarray.Context.html

Parameters:
  • attrs_list (list) – the list of attributes dictionaries.

  • context (xarray.Context) – the context object.

Returns:

the merged attributes dictionary.

Return type:

dict (dict)

modality.contig_dataset.concat_contig_datasets(datasets: List[ContigDataset], write_to_disk: bool = True, output_path=None)#

Concatenate multiple ContigDatasets along the sample_id axis.

Parameters:
  • datasets (List[ContigDataset]) – List of ContigDatasets to concatenate.

  • write_to_disk (bool) – Write the concatenated dataset to disk.

  • output_path (str) – Path to write the concatenated dataset to.

Returns:

Concatenated dataset.

Return type:

ContigDataset

modality.contig_dataset.parse_tuple_string(tuple_string)#

Parse a string representation of a tuple into a tuple of integers, e.g. (1_000, 2_000, 100)

This is needed as the attrs are stored as strings in the zarr store. Types such as tuples are not supported.

Parameters:

tuple_string (str) – the string representation of the tuple.

Returns:

the parsed tuple.

Return type:

tuple (tuple)

modality.contig_slices module#

modality.contig_slices.set_contig_slices(xds: Dataset) Dataset#

Given a dataset with a contig data array, compute the slices for each contig and store them in the dataset attributes.

Parameters:

xds – xarray dataset to be processed

Returns:

input dataset with the contig slices set in the attributes

modality.contig_slices.slices_are_valid(xds: Dataset) bool#

Validate that the slices in the dataset match the contig data positions.

Parameters:

xds – xarray dataset to be validated

Returns:

True if the slices are valid, False otherwise

modality.datasets module#

modality.datasets.load_biomodal_dataset(name: str = 'esE14', release: str = 'v1.0', chr2_only: bool = False, merge_technical_replicates=False, merge_cpgs=False) ContigDataset#

Load an example ContigDataset object. Currently, two datasets are available: ‘esE14’, which loads a 6-base mouse embryonic stem cell (ES-E14) dataset, and ‘giab’, which loads a 5-base human Genome in a Bottle dataset.

Parameters:
  • name – The name of the dataset to load. Available names are: ‘esE14’, ‘giab’.

  • release – release version

  • chr2_only – Whether to download only the chromosome 2 of the dataset, currently only supported for giab.

  • merge_technical_replicates – Whether to combine techical replicates (x2) for the study, reads are summed

  • merge_cpgs – Whether to merge CpG contexts of strand-specific methylation data.

Returns:

ContigDataset with the loaded biomodal dataset

Use the load_dataset_description function to get a description of the available datasets.

modality.datasets.load_dataset_description(name=None) str#
modality.datasets.load_ese14_methylation_data(release: str, chr2_only: bool = False, merge_technical_replicates=True, merge_cpgs=False)#
modality.datasets.load_giab_methylation_data(release: str, chr2_only: bool = False, merge_technical_replicates=True, merge_cpgs=False)#

modality.dmr module#

modality.dmr.call_dmrs(ds: ContigDataset, count_array_name: str, condition_array_name: str, total_counts_name: str = 'num_total_c', covariates: str | list | None = None, dmr_qvalue_cutoff: float | None = None, ranges: PyRanges | int | None = None, presegment_threshold=500, presegment_maxsize=5000, presegment_min_contexts=5, condition_order: list | None = None, as_pyranges: bool = False) PyRanges | DataFrame#

Call DMRs from a ContigDataset.

DMRs are defined as regions with a high variance in methylation levels, and a significant difference in methylation levels between conditions.

Parameters:
  • ds – ContigDataset containing count data.

  • count_array_name – Name of the count array in ds (usually num_modc, num_mc or num_hmc).

  • condition_array_name – Name of the condition array in ds, ie the variable to test for differential methylation.

  • total_counts_name – Name of the total counts array in ds.

  • covariates – Covariates to include in the model. If None, no covariates are included. If a string, a single covariate is included. If a list, multiple covariates are included.

  • dmr_pvalue_cutoff – p-value cutoff for DMR calling. If None, all putative DMRs are returned.

  • ranges – PyRanges containing ranges to evaluate for DMRs or an int representing the tile size. Note: if set to None presegmentation is performed, the contigs are split into regions defined by gaps larger than presegment_threshold.

  • presegment_threshold – Threshold for presegmenting the contigs.

  • presegment_maxsize – Maximum size of the intervals after splitting.

  • presegment_min_contexts – Minimum number of contexts in a presegmented region.

  • condition_order – Order of conditions to use for comparison. If None, the order is determined by the order in which they appear in ds.

  • as_pyranges – Whether to return the DMRs as a PyRanges object. If False, a pandas DataFrame is returned.

Returns:

A PyRanges object containing the DMRs, or a DataFrame if as_pyranges is False. The dataframe is sorted by ascending p-values. The structure of the DMR dataframe is as follows:

  • The first three columns are the contig, start, and end of the genomic range or tile (0-based).

  • num_contexts is the number of contexts in the region.

  • mean_coverage is the mean coverage of the contexts in the tile.

  • mean_mod_group_1``and ``mean_mod_group_2 are the mean methylation levels of the two conditions.

  • mod_fold_change is the fold change of the mean methylation levels between the two conditions, np.mean(condition_1)/np.mean(condition_2).

  • mod_difference is the mean methylation difference between the two conditions, defined as np.mean(condition_1) - np.mean(condition_2).

  • test_statistic is the test statistic of the likelihood ratio test.

  • dmr_pvalue is the p-value of significance of the difference in methylation levels between the two conditions.

  • dmr_qvalue is the FDR-corrected p-value.

Return type:

PyRanges or DataFrame

Notes

Our approach follows that of Methylkit [1]. The contexts of each region are summed together to get the total number of methylated and unmethylated counts. A DMR is defined as a region with a significant difference in methylation levels between two conditions. This implementation uses a logistic regression to model the methylation levels of each region. Assume that we have N samples, and for each sample we have a methylation level p_i (i=1..N) and a total number of counts n_i.

The full model is defined as: .. math:: logit(p_i) = eta_0 + eta_G imes group_i + eta_1 imes covariate_1 + eta_2 imes covariate_2 + …

The null model is defined as: .. math:: logit(p_i) = eta_0 + eta_1 imes covariate_1 + eta_2 imes covariate_2 + …

For each model we calculate the log-likelihood, and we use the likelihood ratio (LLR) test to compare the two models. The difference in log-likelihood between the full and null model is the test statistic. It is computed as: .. math:: LLR = -2 imes (log-likelihood_{null} - log-likelihood_{full})

We derive a p-value which tests the null hypothesis that the null model is correct (i.e., beta_G=0), using a Wald test. We can then correct this p-value for multiple testing using the Benjamini-Hochberg method, which is the dmr_qvalue column in the output.

Covariates can be included in the model to account for confounding factors. Types of covariates include: - Continuous variables: these are included as-is. - Categorical variables: these are converted to dummy variables using one-hot encoding.

References

modality.dmr.dmr_wrapper(block_umod, block_mod, covariates_arr, dmr_call_func)#

This function is responsible for calling the dmr_call_func function on the input arrays and computing the mean methylation levels for each group and the overall mean coverage.

Returns a array of shape (block_umod.shape[0], 5) where the columns are: - p-value of the likelihood ratio test - test statistic - mean methylation group 1 - mean methylation group 2 - mean coverage

modality.dmr.logistic_regression_llr(unmod, mod, covariates_arr)#

This is a logistic regression likelihood ratio test for DMR calling. The function itself is designed to be a parameter for the dmr_wrapper function.

See this comment: cegx-ds/modality#249

We should define a strategy pattern for this function when other methods are implemented.

Parameters:
  • unmod – Array of unmodified counts. N x M where N is number of regions/contexts and M is the number of samples.

  • mod – Array of modified counts. N x M where N is number of regions/contexts and M is the number of samples.

  • covariates_arr – Array of covariates. M x K where M is the number of samples and K is the number of covariates. K is at least 2, for the intercept and the group variable.

Returns:

  • p-value of the likelihood ratio test

  • test statistic

Return type:

Array of shape (N, 2) where the columns are

modality.genomes module#

modality.genomes.get_contig_dtype(ref_name)#

Get the dtype for a contig label for a given reference genome.

Parameters:

ref_name (str) – The reference genome name.

Returns:

The dtype for the contig label.

Return type:

type

modality.genomes.get_genomes_config(params_file)#

Get the genomes config.

Parameters:

params_file – params file path

Returns:

The parsed config file.

modality.genomes.get_sorted_contigs(ref_name, contigs, params_file, return_indices=False)#

Sort contigs by the order expected of reference {ref_name}.

Parameters:
  • ref_name (str) – The reference genome name.

  • contigs (list) – The contigs to be sorted.

  • return_indices (bool) – Whether to return the indices of the sorted contigs.

Returns:

The sorted contigs and optionally their indices.

modality.genomes.parse_config(text)#

Parse a nextflow config file.

TODO move to e.g. utils.

Parameters:

text (str) – The text of the config file.

Returns:

The parsed config file.

modality.index module#

modality.index.is_bgzip(fpath)#

Check if a file is bgzipped.

Parameters:

fpath – The file path.

Returns:

True if the file is bgzipped, False otherwise.

modality.index.is_newer(fpath, fpath2)#

Check if a file is newer than another file.

Parameters:
  • fpath – The first file path.

  • fpath2 – The second file path.

Returns:

True if the first file is newer than the second file, False otherwise.

modality.index.prep_index_factory(ttype, resort=True)#

Decorator factory to prepare an index for a file type.

Parameters:
  • ttype – The file type.

  • resort – Whether to resort the file.

Returns:

The decorator.

Raises:

AssertionError – If the file type is invalid.

modality.pca module#

modality.pca.mean_imputation(array)#
modality.pca.plot_pca_scatter(pca_object, pca_result, xaxis_pc: int = 1, yaxis_pc: int = 2, **kwargs)#

Generates a PCA scatter plot - uses seaborn.relplot

Parameters:
  • pca_object (sklearn.decomposition.PCA|dask_ml.decomposition.PCA) – returned pca object from run_pca

  • pca_result (np.ndarray|da.Array) – Result of PCA transformation from run_pca

  • xaxis_pc (int) – Desired PC from pca_result to plot on the x-axis

  • yaxis_pc (int) – Desired PC from pca_result to plot on the y-axis

  • **kwargs – Additional arguments to be passed to seaborn’s relplot

Returns:

The figure object

modality.pca.plot_pca_scree(pca_object, **kwargs)#

Generates a PCA scree plot - uses seaborn.catplot

Parameters:
  • pca_object (sklearn.decomposition.PCA|dask_ml.decomposition.PCA) – returned pca object from run_pca

  • **kwargs – Additional arguments to be passed to seaborn’s catplot

Returns:

The figure object

modality.pca.remove_majority_nan(array)#
modality.pca.run_pca(input_array: DataArray, n_components: int = 2, svd_solver: str = 'randomized', nan_handling_strategy: str = 'drop', **kwargs)#

Run PCA on a 2D array of data.

Parameters:
  • input_array (xr.DataArray) – The input array to run PCA on. The array should be 2D with sample_ids and contexts/ranges in the rows and columns respectively. If the array is not in this format, it will be transposed.

  • n_components (int) – The number of principal components to return.

  • svd_solver (str) – The solver to use for the PCA. Options are ‘auto’, ‘full’, ‘arpack’, ‘randomized’. See sklearn.decomposition.PCA for more details.

  • nan_handling_strategy (str) – The strategy to use for handling NaN values in the input array. Options are ‘drop’, ‘set_to_zero’ or ‘mean_value_imputation’. When using ‘mean_value_imputation’, columns that contain majority NaN values will be removed before imputation.

  • **kwargs – Additional keyword arguments to pass to the PCA class.

Returns:

The PCA object fit to the input_array. transformed_data (np.ndarray|dask.Array): The transformed input_array after dimensionality reduction.

Return type:

pca_object (sklearn.decomposition.PCA|dask_ml.decomposition.PCA)

modality.plot_features module#

modality.plot_features.catplot(featureset: Dataset, **kwargs)#

Wrapper around seaborn’s catplot function.

{plot_wrapper.__doc__.format(plot_type_name=”catplot”)}

This function specifically handles creating categorical plots.

modality.plot_features.displot(featureset: Dataset, **kwargs)#

Wrapper around seaborn’s displot function.

{plot_wrapper.__doc__.format(plot_type_name=”displot”)}

This function specifically handles creating distribution plots.

modality.plot_features.lmplot(featureset: Dataset, **kwargs)#

Wrapper around seaborn’s regplot function.

{plot_wrapper.__doc__.format(plot_type_name=”regplot”)}

This function specifically handles creating regression plots.

modality.plot_features.plot_wrapper(featureset: Dataset, plot_func, required_vars, **kwargs)#

Wrapper around seaborn’s {plot_type_name} function.

Parameters:
  • featureset – An xarray dataset containing the data to be plotted.

  • kwargs – A dictionary of keyword arguments passed to the {plot_type_name} function.

Returns:

The {plot_type_name} object; i.e. a FacetGrid.

Note if the data argument is provided in kwargs, an error will be raised.

modality.plot_features.prepare_data(featureset: Dataset, required_vars, **kwargs)#

Prepares data for plotting.

modality.plot_features.relplot(featureset: Dataset, **kwargs)#

Wrapper around seaborn’s relplot function.

{plot_wrapper.__doc__.format(plot_type_name=”relplot”)}

This function specifically handles creating relational plots.

modality.ranges module#

modality.ranges.default_position_mask(positions_array: Array | ndarray) ndarray#

Produce a default boolean mask for a dask array of positions.

Parameters:

positions_array (da.Array or np.ndarray) – The dask or numpy array of positions.

Returns:

A boolean mask for the positions array - all False.

Return type:

np.ndarray

Raises:
  • ValueError – If the positions_array is not a dask array of integers.

  • ValueError – If the positions_array is not monotonically increasing.

modality.ranges.get_fixed_windows(contig_lengths: dict, window_size: int = 50000)#

Generate fixed-size genomic windows based on contig lengths.

Notes

1. if you wish to start/end the windows in a particular place, use pr.PyRanges intersect. windows are non overlapping- sliding windows are not supported. 2. The contig_lengths parameter should be a dictionary where keys are chromosome names and values are their corresponding lengths.

Parameters:
  • contig_lengths (dict) – A dictionary mapping chromosome names to their respective lengths.

  • window_size (int) – size of the genomic window to be used. Default is 50,000 base pairs.

Returns:

A PyRanges object representing non-overlapping genomic windows.

Return type:

pyranges.PyRanges

modality.ranges.masked_array_fromranges(positions_array: Array | ndarray, contig: str, ranges: PyRanges) ndarray#

Produce a boolean mask for a dask array of positions based on whether the positions overlap intervals from a pyRanges object.

Parameters:
  • positions_array (da.Array or np.ndarray) – The dask|numpy array of positions.

  • contig (str) – The contig associated with the positions array.

  • ranges (pr.PyRanges) – The pyRanges object containing genomic intervals used for masking.

Returns:

A boolean mask for the positions array.

Return type:

np.ndarray

Raises:
  • ValueError – If the positions are not a dask array of integers.

  • ValueError – If the positions are not monotonically increasing.

  • ValueError – If the ranges are not a PyRanges object.

modality.ranges.merge_ranges(ranges, columns=None, strand=None, sep=';')#

Merge ranges in a PyRanges object.

Parameters:
  • ranges (PyRanges) – The PyRanges object to be merged.

  • columns (list) – The columns to be merged.

  • strand (str) – Only merge intervals on same strand if True

  • sep (str) – The separator to be used.

Returns:

The merged PyRanges object.

Return type:

pr.Pyranges.pyranges

modality.ranges.ranges_loader(ranges: PyRanges | List[PyRanges], ranges_are_1based: bool = False, merge_intervals: bool = False)#

Generate consistent PyRanges object from a list of PyRanges objects with some additional handling.

Parameters:
  • ranges (Union[pr.PyRanges, List[pr.PyRanges]]) – The PyRanges object(s) to be loaded.

  • ranges_are_1based (bool) – Whether the ranges are 1-based. Default is False.

  • merge_intervals (bool) – Whether to merge the intervals. Default is False.

Returns:

The PyRanges object.

Return type:

pr.PyRanges

modality.reduction module#

class modality.reduction.AbstractArray(shape, dtype, on_disk=False)#

Bases: object

Class to hold an array in memory or on disk.

retrieve()#
modality.reduction.chunk_ranges_core(starts, ends, chunk_sizes)#

This function is a numba-compiled version of the following: Given a set of ranges as starts, ends break them into sub-ranges where each sub-range is contained within a chunk. This avoids complexity when using operations such as da.map_blocks that operates over chunks.

Parameters:
  • starts (list or np.array) – List or array of range start positions

  • ends (list or np.array) – List or array of range end positions

  • chunk_sizes (list or np.array) – List or array of chunk sizes

Returns:

the original id for each sub-range chunk_ids (np.array): the index of the chunk that contains each sub-range new_row_starts (np.array): the start position of each sub-range new_row_ends (np.array): the end position of each sub-range

Return type:

chunk_index (np.array)

modality.reduction.count_non_nan(a)#

Count the number of non-nan values in an array.

Parameters:

a (np.array) – The array to count.

Returns:

The number of non-nan values in the array.

Return type:

np.array

modality.reduction.get_nansum(a)#

Get the sum of an array, ignoring nan values.

Parameters:

a (np.array) – The array to sum.

Returns:

The sum of the array.

Return type:

np.array

modality.reduction.get_tile_index(pos, length=1000)#
modality.reduction.reduce_block_bytile(coor_block, data_block, reducer=<ufunc 'add'>, length=100)#

Function to reduce a block of data by tile.

Parameters:
  • coor_block (np.array) – The block of coordinates.

  • data_block (np.array) – The block of data.

  • reducer (function) – The reduction function.

  • length (int) – The length of the tiles.

Returns:

The reduced block.

Return type:

np.array

modality.reduction.reduce_bycolumn_core(a, output)#
modality.reduction.reduce_byslices_core_float(starts, ends, a, output)#
modality.reduction.reduce_byslices_core_int(starts, ends, a, output)#

modality.reference module#

modality.reference.get_context_positions_pyranges(fasta_path, contig, context='CG', start=0, end=None, step_size=1000000, chunk_size=100000, forward_only=False)#
modality.reference.get_reference_name(fpath)#

modality.utils module#

modality.utils.chunked_generator(iterator: Iterator, chunk_size: int)#

Iterate over an iterator in fixed-size chunks.

Parameters:
  • iterator (iterable) – The input iterator to be chunked.

  • chunk_size (int) – The size of each chunk.

Yields:

list

A chunk of elements from the input iterator, with each chunk having at most chunk_size

elements.

modality.utils.compose_dataset_ranges(contig_slices, positions)#

Internal function to compose a pyranges object from a dataset.

Essentially it loops through all the contigs in the dataset and creates a pyranges object spanning from the first to the last positions

modality.utils.create_trinucleotide_dict()#
modality.utils.filter_warnings(package='modality')#

Filter warnings if package is installed in editable mode, i.e. when it is installed for development instead of used as a library.

Function to be run at the top of all package submodules that users may import directly or import specific objects from them.

modality.utils.get_confpath(conf_name: str) str#

Get the path to a configuration file.

Parameters:

conf_name – The name of the configuration file.

Returns:

The path to the configuration file.

modality.utils.is_jsonable(x)#

Check if an object is JSON serializable.

Parameters:

x (object) – The object to be checked.

Returns:

True if the object is JSON serializable, False otherwise.

modality.utils.make_name_unique(string_value: str)#

Append value to make e.g. column unique

modality.utils.package_in_editable_mode(package: str) bool#

Check if package is installed in editable mode. Returns False if the package is not installed.

Parameters:

package – package name

Returns:

boolean of whether the package is installed in editable mode.

modality.utils.rechunk_dataset(template_data: Dataset, axis0_chunk_size: int = 100000, axis1_chunk_size: int = 2)#

Rechunk an xarray dataset. :param template_data: The dataset to be rechunked. :type template_data: xr.Dataset :param axis0_chunk_size: The desired chunk size for the first axis. :type axis0_chunk_size: int :param axis1_chunk_size: The desired chunk size for the second axis. :type axis1_chunk_size: int

Returns:

The rechunked dataset.

Return type:

xr.Dataset

modality.utils.sampled_generator(iterator: Iterator, sampling: float, chunk_size: int | None = None, seed: int = 42) Generator#

A function that applies random sampling to each chunk of an iterator.

Notes

  1. The input iterator can be of any type, and elements are internally converted to NumPy arrays for consistency.

  2. If sampling is None, the entire iterator is yielded without any sampling.

  3. If chunk_size is provided, the iterator is divided into chunks before sampling, improving efficiency.

  4. The random sampling is performed using a uniform distribution between 0.0 and 1.0.

  5. The function uses a pseudo-random number generator with the provided seed for reproducibility.

  6. The output is generated incrementally in chunks, reducing memory usage for large iterators.

Parameters:
  • iterator (iterable) – The input iterator to sample from.

  • sampling (float or None) – The sampling rate, indicating the probability of including each element in the output. If None, no sampling is performed, and the entire iterator is returned.

  • chunk_size (int or None, optional) – If provided, the iterator is divided into chunks of this size before sampling. If None, no chunking is performed, and the entire iterator is considered as a single chunk. Default is None.

  • seed (int, optional) – Seed for the random number generator. Default is 42.

Yields:

numpy.ndarray

Each yielded item is a NumPy array containing sampled elements from the input

iterator.

Raises:

AssertionError – If the product of the chunk size and the sampling rate is less than or equal to 10. This is to ensure that the sampling is efficient.

modality.utils.subtract_from_genes(group, genes)#

Subtract a group of ranges from a gene.

Parameters:
  • group (pd.DataFrame) – A group of ranges.

  • genes (pd.DataFrame) – A dataframe of genes.

Returns:

The resulting dataframe after subtracting the ranges from the gene.

Return type:

pd.DataFrame