Getting Started#
Note that modality
follows the standard Python convention:
Coordinates are 0-based.
Intervals are half-open, i.e. the start coordinate is included, but the end coordinate is not.
Core Concepts#
modality
is a library for working with genomic data. The core object is a ContigDataset
. This contains the arrays that
represent the methylation counts, and contains a set of methods for working with them.
The ContigDataset
contains:
num_modc
- the number of modified cytosinesnum_total_c
- the total number of cytosinesnum_c
- the number of unmodified cytosinesnum_other
the number of other basesnum_total
- the total number of all bases, including Ns.
When working with duet evoC, biomodal’s 6-base product, the ContigDataset
also contains:
num_mc
- the number of methylated cytosinesnum_hmc
- the number of hydroxymethylated cytosines
These arrays are 2-Dimensional, with the first dimension being the genomic position, and the second dimension being the
sample. Additional covariates can be added to the ContigDataset
as additional arrays using the xarray
syntax.
The pos
and contig
arrays describe the genomic position and contig of the data. These are 1-Dimensional arrays, with
the same length as the first dimension of the methylation arrays. This concept of contiguous arrays is central to the
ContigDataset
object and its methods.
In both duet +modC and duet evoC, the meaning of num_modc
is consistent: it is the sum of methylated and hydroxymethylated
cytosines.
If you are interested in the methylation fraction, you can calculate it using the assign_fractions
method.
The ContigDataset
is subset friendly, and we encourage you to use the subsetting functionality where possible, as
methods always act on the whole dataset. For example, if you want to plot the correlation of the methylation fraction
on a region of chromosome 1, you would:
import modality
from modality.contig_dataset import ContigDataset
mx = ContigDataset.from_zarrz("mystore.zarrz")
mx.assign_fractions("num_modc", "num_total_c", min_coverage=10, inplace=True)
mx["chr1:2_546_000-3_789_000"].ds.plot_pearson_matrix(
numerator="num_modc",
denominator="num_total_c",
min_coverage=15,
vmin=0.8,
)
Some functions, such as dmr.call_dmrs
require that the data only contains a single contig, so you must subset the
data before calling these methods.
Loading and accessing data#
The biomodal duet pipeline will output a zipped zarr store that is opened by ContigDataset
.
The ContigDataset
can be loaded from a zipped zarr store using the from_zarrz
method. This method takes a path to
the zarr store. modality
handles local and remote zarr stores; it is important to note that if a remote path is given the zarr store is
cached locally.
As part of the release of modality
, we have included two example datasets. These are:
A duet +modC dataset of human Genome in a bottle (GIAB) samples. Each of the seven GIAB samples were sequenced on a single lane of a NovaSeq S4 flowcell. Two technical replicates are available for each sample. This data is aligned to the hg38 genome.
An duet evoC dataset of mouse ES-E14 cells, data contains 4 technical replicates. These data aligned to the mm38 genome.
These datasets can be loaded using the modality.datasets
module. For example:
from modality.datasets import load_biomodal_dataset
mx = load_biomodal_dataset("esE14") # or "giab"
mx
Please note, the load_dataset
function will download the data from the internet if it is not already present in the cache.
The esE14 dataset is 750Mb and the giab dataset is 3.1Gb.
Adding new data to existing ContigDataset
objects#
Adding metadata is a common operation, and so we have included a convenience method to do this in the ContigDataset
object. Additional metadata can be read into the ContigDataset
using the merge_meta
method. This method takes a
pandas DataFrame or a filepath, and merges the metadata into the ContigDataset
.
The csv file should have a “sample_id” column that matches the names of the samples in the ContigDataset
.
new_ds = ds.merge_meta(metadata, coord_columns=coord_columns)
Note this is only for adding data along the sample dimension.
A more general solution is to use the xarray
syntax to add new data to the ContigDataset
or any other xarray
object (such as the output of ds.reduce_byranges
, using the assign
method, documented
here).
It is important to be explicit about the dimensions of the data being added.
For example, to add a new 1D array of random data along the pos
dimension, you would do the following:
data_array = np.random.random(ds.sizes["pos"])
ds.assign(
example_data=(
"pos",
data_array)
)
If pos
is ommitted, a new dimension is created (which is usually not what you want):
ds.assign(example_data=data_array)
If the data is 2D, you must specify both dimensions:
data_array = np.random.random((ds.sizes["pos"], ds.sizes["sample"]))
ds.assign(
example_data=(
("pos", "sample_id"),
data_array)
)
Note, this syntax is equivalent to the following:
ds["example_data"] = ("pos", data_array)
# or
ds["example_data"] = (("pos", "sample_id"), data_array)
Which may be clearer in most cases.
Other methods are available, but we encourage you to read the xarray
documentation, as the ContigDataset
is a
subclass of xarray.Dataset
.
Wider reading#
The ContigDataset
is a subclass of xarray.Dataset
, and so it has all the methods and properties of an xarray.Dataset.
We encourage you to read the xarray documentation to understand how to work with the ContigDataset
object.
FAQ#
How do I combine samples from multiple ContigDatasets?
If you have multiple ContigDataset
objects, coresponding to seperate sets of samples, stored seperately on disk, we provide a convenience function for combining them. Specifically, one can use concat_contig_datasets() to concatenate multiple ContigDataset
objects along the sample_id
axis.
This function takes a list of loaded ContigDatasets
and returns a single ContigDataset
with all the samples combined. By default this function will write the final concatenated dataset to disk and then reload this under the variable you assign in the process. This unlocks the downstream efficiency of modality
for your new dataset.
For example, if you have two stores of data, you can combine them using the following code:
from modality.contig_dataset import ContigDataset, concat_contig_datasets
mx1 = ContigDataset.from_zarrz("store1.zarrz")
mx2 = ContigDataset.from_zarrz("store2.zarrz")
datasets_to_combine = [mx1, mx2]
concatenated_dataset = concat_contig_datasets(
datasets = datasets_to_combine,
output_path = "store_combined.zarrz"
)
Please note: This function is specifically for combining seperate samples across otherwise consistent datasets. The function will check for compatible contigs, reference positions and other data variables and raise an error if they are not compatible.
How do I combine the counts across both the positive and negative strands for each CpG?
If you wish to combine counts across each strand for each CpG in your dataset you can use the .merge_cpgs() method. This method will combine the counts for each CpG across the positive and negative strands. If you wish to merge CpGs we recomend performing this operation at the start of your analysis and before additional data variables may be added to your dataset. This is because the operation reduces the number of positions by 50% (sums the counts across strands recorded as adjacent CpG positions) and it is not possible to infer the correct handling of all posible data arrays in this process. Thus, any non-default array with a pos
dimension will be dropped when performing this operation.
The code block below demonstrates the usage of this method:
from modality.datasets import load_biomodal_dataset
mx = load_biomodal_dataset('giab') # or 'esE14'
mx
merged_cpg_mx = mx.merge_cpgs()
merged_cpg_mx