Clustering sequence data with OPTICS

Introduction

This notebook is intended to give an overview of the clustering module in ashure.py. The following will examine both the naive and augmented implementation of OPTICS on clustering nanopore sequence data and show how applying OPTICS to sequence data can help us infer the true haplotype sequences present in our sequencing experiment.

Loading the dataset

The pseudo reference database generated by ashure.py will be used as the input dataset for the clustering module. To generate the pseudo reference database, run ashure.py on this subset of basecalled reads we have provided.

Next, run the clst module of ashure.py on the pseudo reference database generated. This will compute the cluster centers and write the output to pseudo_clst.csv.gz. We will visualize these cluster centers relative to the whole dataset in the next section.

Visualizing sequence data with t-SNE

t-SNE can help visualize the relationship between sequences by performing dimensionality reduction on the pairwise distances between sequences. In the code snippet below, pseudo reference sequences, cluster center sequences of the pseudo reference, and the mock50 reference sequences are loaded and concatenated. Pairwise alignments are computed via run_minimap2() and saved to a csv file called pseudo_pw.csv.gz

The code snippet below loads the alignment information generated by minimap2, transforms it into a distance matrix, and runs t-SNE on the pairwise distance data. t-SNE output is written to a pseudo_tsne.csv.gz for plotting.

The following code snippet loads the t-SNE data, matches the sequences to their haplotype via minimap2, and adds color maps to each cluster label.

The following code snippet plots the t-SNE data with bokeh and adds interactive tooltips for each datapoint.

Each circle above represents a sequence from the pseudo reference database. Red squares are the cluster centers computed by our augmented implementation of OPTICS. Blue diamonds are the haplotype sequences.

Each sequence from the pseudo reference database is currently colorized by the cluster_id, which is the label associated with each cluster center computed by our implementation of OPTICS.

On this t-SNE plot, data points lying close to each other have low pairwise distance, which mean their sequences match closely to each other. Data points far apart are more unrelated. The overlap between haplotype sequence and cluster centers on the t-SNE plot indicate that cluster centers computed by the clst module match closely to the true haplotype sequence.

Despite these reads being highly errored, density based clustering approaches, such as OPTICS, are able to classify them into haplotype groups. Another benefit of OPTICS is computation distances from a theoretical cluster center, which is termed reachability. Lower reachability means the sequence is closer to the centroid of a cluster of sequence. Reachability is highly correlated with error in the sequence, and we are able to use it to infer the true haplotype sequence from corrupted reads.

Why this works is more apparent if we examine the ordering vs reachability plot generated by OPTICS.

Correlation between reachability and error profile of reads

The OPTICS algorithm attempts to identify centroid regions in a set of data based on density. After centroid regions are identified, data points are ordered relative to their distance from these initial centroids. The reachability metric is the distance from the center of a cluster.

The resulting reachability vs ordering plot reveal a series of dips, which correspond to spatially dense clusters of data. In the sequence data context, these low reachability regions correspond to sequences which match closely to each other and have low error. Moving away from the cluster center, the sequences become more disimilar and have higher error.

Our augmented implementation of OPTICS (the clst module) tries to infer the identity of the haplotype sequence by performing multi-alignment on sequences in these low reachability regions and computing an error corrected consensus. This procedure is analogous to computing the centroid of a set of data in k-means clustering.

tSNE without haplotype and cluster center bias

The above plots were computed with haplotype sequences and cluster center sequences included in the data. This could potentially bias clustering results. To show that OPTICS can identify low error reads unbiasly, the same analysis is repeated while excluding cluster center and haplotype sequences from the computation.

tSNE plot of sequences colorized by haplotype identity

The left tSNE plot has been colorized by the haplotype sequence. The middle plot is colorized by read error. The right plot is colorized by the reachability computed by OPTICS.

Haplotype from the same species lie close together in this spatial map. Cluster centroids tend to have low error and low reachability relative to peripheries of the clusters. This recapitulates what is already seen in the OPTICS ordering plot.

Reachability plot colorized by haplotype identity

Here the reachability plot has been colorized by the haplotype sequence. Sequences with low reachability have low error. Sequences from the same haplotype fall in the same cluster as defined by OPTICS.

OPTICS on large datasets and room for improvement

The naive implementation of OPTICS in sci-kit learn presented above was unsuitable for clustering large (>10000 sequences) datasets. Computing pairwise distances for 10000 sequences is CPU and memory intensive. To make OPTICS tractable for computation on a laptop, random sampling was used to estimate the global distribution of pairwise distances, and OPTICS was modified to work on subsets of the whole data. This approach avoided computation of large pairwise distance matrices, saving significant memory and CPU time. However, this algorithm must now be run iteratively a kin to how k-means clustering works. If enough iterations are run, all cluster centers could be found. However, convergence is not always guaranteed. As shown in the tSNE plots, some cluster centers are still suboptimal, and we are still seeking ways to filter out bad results.