Getting Started#

Note that modality follows the standard Python convention:

  • Coordinates are 0-based.

  • Intervals are half-open, i.e. the start coordinate is included, but the end coordinate is not.

Core Concepts#

modality is a library for working with genomic data. The core object is a ContigDataset. This contains the arrays that represent the methylation counts, and contains a set of methods for working with them.

The ContigDataset contains:

  • num_modc - the number of modified cytosines

  • num_total_c - the total number of cytosines

  • num_c - the number of unmodified cytosines

  • num_other the number of other bases

  • num_total - the total number of all bases, including Ns.

When working with duet evoC, biomodal’s 6-base product, the ContigDataset also contains:

  • num_mc - the number of methylated cytosines

  • num_hmc - the number of hydroxymethylated cytosines

These arrays are 2-Dimensional, with the first dimension being the genomic position, and the second dimension being the sample. Additional covariates can be added to the ContigDataset as additional arrays using the xarray syntax.

The pos and contig arrays describe the genomic position and contig of the data. These are 1-Dimensional arrays, with the same length as the first dimension of the methylation arrays. This concept of contiguous arrays is central to the ContigDataset object and its methods.

In both duet +modC and duet evoC, the meaning of num_modc is consistent: it is the sum of methylated and hydroxymethylated cytosines.

If you are interested in the methylation fraction, you can calculate it using the assign_fractions method.

The ContigDataset is subset friendly, and we encourage you to use the subsetting functionality where possible, as methods always act on the whole dataset. For example, if you want to plot the correlation of the methylation fraction on a region of chromosome 1, you would:

import modality
from modality.contig_dataset import ContigDataset
mx = ContigDataset.from_zarrz("mystore.zarrz")
mx.assign_fractions("num_modc", "num_total_c", min_coverage=10, inplace=True)
mx["chr1:2_546_000-3_789_000"].ds.plot_pearson_matrix(
    numerator="num_modc",
    denominator="num_total_c",
    min_coverage=15,
    vmin=0.8,
)

Some functions, such as dmr.call_dmrs require that the data only contains a single contig, so you must subset the data before calling these methods.

Loading and accessing data#

The biomodal duet pipeline will output a zipped zarr store that is opened by ContigDataset.

The ContigDataset can be loaded from a zipped zarr store using the from_zarrz method. This method takes a path to the zarr store. modality handles local and remote zarr stores; it is important to note that if a remote path is given the zarr store is cached locally.

As part of the release of modality, we have included two example datasets. These are:

  • A duet +modC dataset of human Genome in a bottle (GIAB) samples. Each of the seven GIAB samples were sequenced on a single lane of a NovaSeq S4 flowcell. Two technical replicates are available for each sample. This data is aligned to the hg38 genome.

  • An duet evoC dataset of mouse ES-E14 cells, data contains 4 technical replicates. These data aligned to the mm38 genome.

These datasets can be loaded using the modality.datasets module. For example:

from modality.datasets import load_biomodal_dataset

mx = load_biomodal_dataset("esE14")  # or "giab"
mx

Please note, the load_dataset function will download the data from the internet if it is not already present in the cache. The esE14 dataset is 750Mb and the giab dataset is 3.1Gb.

Adding new data to existing ContigDataset objects#

Adding metadata is a common operation, and so we have included a convenience method to do this in the ContigDataset object. Additional metadata can be read into the ContigDataset using the merge_meta method. This method takes a pandas DataFrame or a filepath, and merges the metadata into the ContigDataset.

The csv file should have a “sample_id” column that matches the names of the samples in the ContigDataset.

new_ds = ds.merge_meta(metadata, coord_columns=coord_columns)

Note this is only for adding data along the sample dimension.

A more general solution is to use the xarray syntax to add new data to the ContigDataset or any other xarray object (such as the output of ds.reduce_byranges, using the assign method, documented here). It is important to be explicit about the dimensions of the data being added. For example, to add a new 1D array of random data along the pos dimension, you would do the following:

data_array = np.random.random(ds.sizes["pos"])

ds.assign(
    example_data=(
        "pos",
        data_array)
    )

If pos is ommitted, a new dimension is created (which is usually not what you want):

ds.assign(example_data=data_array)

If the data is 2D, you must specify both dimensions:

data_array = np.random.random((ds.sizes["pos"], ds.sizes["sample"]))

ds.assign(
    example_data=(
        ("pos", "sample_id"),
        data_array)
)

Note, this syntax is equivalent to the following:

ds["example_data"] = ("pos", data_array)
# or
ds["example_data"] = (("pos", "sample_id"), data_array)

Which may be clearer in most cases.

Other methods are available, but we encourage you to read the xarray documentation, as the ContigDataset is a subclass of xarray.Dataset.

Wider reading#

The ContigDataset is a subclass of xarray.Dataset, and so it has all the methods and properties of an xarray.Dataset. We encourage you to read the xarray documentation to understand how to work with the ContigDataset object.

FAQ#

How do I combine samples from multiple ContigDatasets?

If you have multiple ContigDataset objects, coresponding to seperate sets of samples, stored seperately on disk, we provide a convenience function for combining them. Specifically, one can use concat_contig_datasets() to concatenate multiple ContigDataset objects along the sample_id axis.

This function takes a list of loaded ContigDatasets and returns a single ContigDataset with all the samples combined. By default this function will write the final concatenated dataset to disk and then reload this under the variable you assign in the process. This unlocks the downstream efficiency of modality for your new dataset.

For example, if you have two stores of data, you can combine them using the following code:

from modality.contig_dataset import ContigDataset, concat_contig_datasets

mx1 = ContigDataset.from_zarrz("store1.zarrz")
mx2 = ContigDataset.from_zarrz("store2.zarrz")

datasets_to_combine = [mx1, mx2]
concatenated_dataset = concat_contig_datasets(
    datasets = datasets_to_combine,
    output_path = "store_combined.zarrz"
)

Please note: This function is specifically for combining seperate samples across otherwise consistent datasets. The function will check for compatible contigs, reference positions and other data variables and raise an error if they are not compatible.

How do I combine the counts across both the positive and negative strands for each CpG?

If you wish to combine counts across each strand for each CpG in your dataset you can use the .merge_cpgs() method. This method will combine the counts for each CpG across the positive and negative strands. If you wish to merge CpGs we recomend performing this operation at the start of your analysis and before additional data variables may be added to your dataset. This is because the operation reduces the number of positions by 50% (sums the counts across strands recorded as adjacent CpG positions) and it is not possible to infer the correct handling of all posible data arrays in this process. Thus, any non-default array with a pos dimension will be dropped when performing this operation.

The code block below demonstrates the usage of this method:

from modality.datasets import load_biomodal_dataset

mx = load_biomodal_dataset('giab') # or 'esE14'
mx

merged_cpg_mx = mx.merge_cpgs()
merged_cpg_mx