Generating a pseudo reference database

Loading data into python

After running the basecaller on fast5 files, you will have a folder containing basecalled fastq reads. contains a few useful functions for reading and writing fastq and fasta files. As some of these files could be quite large, we use the low_mem=True option to save on RAM when running this notebook, but still allow us to fetch the sequence and quality strings.

id is a unique identifier string for each fastq read.

ch is the channel where the read originated from.

start_time is the acquisition time of read.

length is the length of the sequence.

Apply add_seq() to df_reads to get the sequence and quality information.

Searching for reference sequences uses reference sequence to search for concatemers in PCR products generated by rolling circle amplification. These repeats are multi-aligned with spoa to generate a more error free consensus read. In the absence of a reference database, we would need a best guess at what we are sequencing. This is our pseudo reference database, which can be generated by performing a paired end search for primer sequences in the basecalled reads.

Searching every read for forward and reverse primer pairs is time intensive and unnecessary. PCR amplifies the original templates, making the reference over represented. We can filter out large repetitive fragments and just examine a subset that are likely to contain our reference sequences. For our study, we focus only on reads 500-1200bp long because these reads are likely to contain only one concatemer of CO1.

For users who are using short primers <20bp, the paired end primer search may fail because of read corruption. This is a limitation of minimap2, which was not design to align highly errored short sequences. To address this shortcoming, we recommend the following potential fixes to our users:

1) Augment primer sequences by combining the forward and reverse primer sequences. Use this as the new fwd and reverse sequence for the paired end search. pfwd = ATGC prev = GGGG new-pfwd = CCCC ATGC new-prev = GCAT GGGG

2) If the identity of the gene being sequenced is known, augment the primer sequences by add 40-60bp from both of ends of the gene being sequenced. The drawback of this approach is that the paired primer search becomes biased by the addition of the user sequence. The pseudo reference database may not contain all the gene families you are interested in examining ,and you may need to reexamine the reads which did not yield any hits in the first run of in order to make sure your sequencing results are complete.

The cytochrome oxidase subunit 1 (CO1) gene is about 700bp. Size filtering is used to filter out concatemers. We want to focus on non-repeated reads because there is a chance some reads are so corrupted that the primer sequence could not be found in the correct location. You could end up with concatemers instead of a single full length CO1.

To illustrate this, lets run with length filtering from 500-1200bp and 500-3000.

With larger bp size windows, concatemers will show up as peaks at around 1500bp and 2400bp. These concatemers are sometimes undesirable because they interfer with proper dereplication of an RCA read. In special cases, concatemers could be desirable because the gene of interest has many repetitive domains such as ankyrin proteins. We leave these decisions up to the user as you are the expert on the gene you are sequencing.

Filtering and concatenation of reads by size is fairly straight forward. An example is provided below.

x = df_prf2[(df_prf2['length'] > 700) & (df_prf2['length'] < 1000)]    # get sequences between 700-1000bp
y = df_prf2[(df_prf2['length'] > 1200) & (df_prf2['length'] < 1600)]   # get sequences between 1200-1600bp
df = pd.concat([x,y]) # concatenate the dateframes
df.to_csv('example.csv.gz', compression='infer', index=False)    # write the data to csv

Clustering and compressing the pseudo reference

Compacting and dereplicating the pseudo reference can significantly speed up run time without hurting accuracy and sensitivity. This is possible because the pseudo reference contains redundant haplotype sequences. Remember that exact matches are not required. The pseudo reference is meant to be a fuzzy representation of the overall dataset such that genes of interest can quickly be found.

The code snippet below shows how to align the pseudo reference against the true reference database for mock50 samples and reveal the redundancy of information in the pseudo reference.

run_minimap2() is a wrapper function to minimap2 in To get more help, type help(bpy.run_minimap2) in python shell.

get_best() is a helpful function to filter to the best alignment

Alignment of fasta or csv files against other fasta or csv file can also be invoked using This script depends on Below is an example usage:

./ -h     # get help
./ -i pseudo_ref.csv.gz -o acc.csv.gz   # input pseudo_ref.csv.gz and output to acc.csv.gz

The violin/swarm plot above shows that some haplotypes are represented by 1 sequence, while some are represented by several hundred sequences. Compute time is waste on aligning every fastq read to every closely related pseudo reference sequence.

The ideal pseudo reference database for this dataset is around 50 sequences or one sequence for each mock50 sample. The pseudo reference can be compacted by invoking the clst module in The clst module uses OPTICS density clustering to order, label, merge, and return the representative center sequences for each haplotype. OPTICS is an unsupervised clustering algorithm that uses density to flag potential sequence clusters and cluster centers.

Clustering removed many of the redundant sequences. Although some rare haplotypes may have been lost, we retain the majority of the interesting sequences which could still be in our sample.

For some sequences, the accuracy of the pseudo reference entry improves because merging via multi-alignment averages out the sequencing errors. For other sequences, the accuracy seems to have decreased because wrong sequences were merged together.

More on this will be elaborated in the next section and in clustering.ipynb

To run the remainder of with this compressed pseudo reference database, an example is shown below:

./ -h                  # get help
./ run -fq fastq/* -p primers.csv -db pseudo_clst.csv.gz -o1 cons.csv.gz --low_mem  # run the whole pipeline

Detection of rare haplotypes

Sometimes rare haplotypes are missed in the first run of This could be because rare haplotypes were not included in the pseudo reference database and were sufficiently distinct from found haplotypes that minimap2 did not find good alignments. Rare haplotypes could be recovered by re-examing the reads which did not yield any hits, blasting some of these reads, and/or adding some of these reads into the pseudo reference database.

The pseudo reference database does not need to be perfect. The sequences only need to approximately match your gene of interest. If the rare gene is wholly missed in the first run, can be rerun on reads which did not strongly map to anything.

Visualizing the pseudo reference clusters

The similarity relationships between sequences can be visualized via a dendrogram, pairwise matrix, or tsne plot. In this section, we show how functions in can help you generate pairwise distance data and make tsne plots to visualize your sequence data.

Getting pairwise distance information

In the code snippet below, pseudo reference sequences, cluster center sequences of the pseudo reference, and the mock50 reference sequences are loaded and concatenated. Pairwise alignments are computed via run_minimap2() and saved to a csv file called pseudo_pw.csv.gz.

Plotting the TSNE representation of the distance relationships

Dendrograms are not always suitable for viewing sequence relationships. Complicated relationships such as evenly spaced points on a sphere are not captured by dendrograms. Neighborhood embeddings such as TSNE are better at preserving high dimensional relationships. TSNE projects high dimensional relations onto low dimensional space and tries to minimize the divergence in neighborhood structure between the low dimension and high dimension representations of the data.

The code snippet below shows how to apply TSNE to the pseudo reference and obtain a 2D representation of the relationships between sequences. HDBSCAN is used to flag clusters in 2D space. Data is returned as a pandas dataframe and saved to pseudo_tnse.csv.gz.

The code snippet below aligns the pseudo reference against the pseudo cluster centers and true reference. This information lets us colorize the scatter plots and interpret how well OPTICS performed.

The above code snippet colorizes each sequence by the de novo cluster obtained from Blue diamonds denote the position of reference haplotype sequences. Red squares denote the position of de novo cluster centers from

The following code snippet colorizes sequences by percent match to the haplotype sequence.

The sequence of each de novo cluster center can be obtained using the code snippet below. Feel free to blast these to check their alignment against the reference.