Optimization of cerebrospinal fluid microbial DNA metagenomic sequencing diagnostics
We implemented a metagenomic DNA sequencing methodology to unbiasedly detect microbial species in CSF samples from patients with CNS symptoms in which a pathogen or EBV had been detected (Additional 3: Table 1). Samples positively identified with pathogen-specific quantitative PCR (qPCR), 16S rRNA gene sequencing or bacterial/mycotic culture in CSF were included. Different pathogen types and variation of viral loads were chosen. The majority of samples CSF contained low levels of EBV, but variable levels of leukocyte content. These samples were chosen to establish the sensitivity of the method. DNA from each sample was extracted and fragmented before library preparation and sequenced using massive parallel sequencing. An overview of the experimental procedure, bioinformatic classifiers, and post-classification analyses are depicted in Fig. 1.
DNA metagenomic sequencing workflow. DNA from cerebrospinal fluid specimens, containing leukocytes and pathogens, was extracted and followed by library preparation and sequencing. Datasets generated by the Ion S5 were processed by four different bioinformatics classifiers to profile the microbiome. BLAST was used for verification. Flowchart for identification of pathogens by removing false positive species. Virus contaminants can be removed by comparison of sample datasets with controls by which environmental and bioinformatic misclassifications are identified. Phages can be disregarded as these viruses do not infect human cells. A final manual examination of remaining viral reads is required for coverage and sequence analysis. The bacterial contaminants were removed by applying a filter of cutoff value and comparison between classifiers and controls followed by a manual examination.
Bioinformatic classifiers
Four bioinformatic classifiers were included, Kraken2, Centrifuge, our in-house developed PaRCA (Pathogen detection for Research and Clinical Applications) and CosmosID. CosmosID was tested mainly for its ability to generate concise pathogen lists, but the format of the platform prevented a detailed analysis of the raw data and was therefore not included in all comparisons in the manuscript. The four bioinformatic classifiers diverged with regards to fraction of processed reads (from 85 to 100%, Additional 3: Tables 2–3). However, the ability to identify the primary pathogen was similar when comparing the classifiers.
Sensitivity
Initially, three CSF samples (Sample 1–3) with high virus load of herpesvirus were analyzed. HSV1 and VZV were detected by all bioinformatic classifiers (Table 1). In sample 1, HSV1 was positively identified at 1 × 104 genome equivalents per milliliter (Geq/ml) using qPCR. The sequencing library consisting of more than 15 million reads contained 6.2–7.2 HSV1 reads per million sequences analyzed (parts per million; ppm). The following two samples originated from patients with similar values of VZV DNA levels quantified by qPCR (1.9 and 3.9 × 105 Geq/ml). Despite equivalent levels a ten-fold difference in detected VZV reads was observed between sample two (15–16 ppm) and sample three (135–147 ppm). Sample 2 contained 272 × 106 white blood cells (WBC) per liter compared with sample 3 which contained 17 × 106 WBC per liter (Table 1). Thus, the difference in sensitivity was related to variations in the leukocyte composition in the samples.
To further test the sensitivity, two CSF samples containing JC polyomavirus (JCV), a DNA virus with a relatively small genome, were processed. One sample contained high virus levels (1.9 × 105 Geq/ml) and the other low virus levels (4.3 × 103 Geq/ml) (Sample 4–5). JCV DNA was readily detected in both samples ranging from 1757 to 2096 ppm in sample 4 and 40–57 ppm in sample 5.
In order to verify that the methodology was applicable for bacterial agents, we sequenced CSF from two patients with pneumococcal meningitis, diagnosed by cultivation and/or 16S rRNA gene Sanger sequencing (Sample 6–7). DNA from Streptococcus pneumoniae (S. pneumoniae) was classified with a range between 30,704 to 60,661 ppm (Sample 6), and 679–804 ppm (Sample 7). In addition to the bacterial samples, we included two CSF samples from patients with RNA viral enterovirus CNS infection (Sample 8–9). As expected, no DNA reads were identified. Enterovirus was, however, found using metagenomic RNA sequencing (Additional 1: Fig. 1).
Samples with co-infections, where EBV was detected along with a primary infectious agent (Enterovirus sample 9, VZV sample 10–11 and Cryptococcus sp. sample 12), were analyzed. EBV was not detected in sample 9 in accordance with the predicted sensitivity. The enterovirus RNA was as expected not detected in our DNA sequencing libraries. VZV and EBV were detected in sample 10, and only VZV was detected in sample 11. Neither yeast nor EBV DNA was detected in sample 12. The results were expected when the following equation was applied for calculating the sensitivity for each agent.
The theoretically expected number of pathogen reads was calculated using the Eq. (1) according to pathogen genome size (GP), the diploid human genome size of 6.5 billion base pairs (GH), pathogen copy according to PCR per milliliter (CP), whole blood cell count per milliliter (CH), and adjusted according to the volume (V), sequencing library size (L) and mappability in percent (M) to remove major contaminants.
$$Pathogen, read = L/left( {frac{{C_{H} times G_{H} times V}}{{C_{P} times G_{P} }} times M^{ – 1} } right)$$
Thus, in 0.3 ml CSF with a normal WBC count (5 × 103 per milliliter), containing one pathogen with a 1 million base pairs genome, a sequencing of library 10 million reads would produce one single pathogen read, provided that the mappability is > 95%.
We included additional nine CSF samples with low levels of EBV DNA (50–2000 Geq/ml) (Sample 13–21). With the exception of sample 13 (patient diagnosed with CNS Hodgkin’s lymphoma type Post-Transplant Lymphoproliferative Disorder), and sample 16, where EBV was considered the cause of the symptoms, the EBV findings were clinically interpreted as benign incidental findings i.e. not the causative agent for the symptoms of infection. The EBV DNA detected in the majority of samples is likely to originate from latently infected B-lymphocytes recruited into the CSF. Despite the limitations for absolute quantification using qPCR and the stochasticity of distribution of low-level pathogen particles, with one exception the calculated reads correlated with the detected reads in the sequencing data (Table 1). In ten samples, more than 1 viral reads were expected and pathogen sequences were found in all samples (Fig. 2a). In eight samples where less than 1 read was expected to be found, EBV reads were only detected in one dataset (sample 17). Sixteen copies of EBV per milliliter was detected in sample 17 using qPCR and 11 reads were detected using metagenomic sequencing even though 0.3 reads were expected. The discrepancy between the calculation and sequencing results is most likely due to the stochastic distribution of the few viral particles in the sample. In sample 20, 0.99 reads were expected to be detected in the dataset and a single EBV-read was identified in two of the four classifiers (Kraken2 and Centrifuge). This read was further confirmed using BLAST. The WBC count in sample 18 was below the reference interval of the leukocyte cell counter and was therefore omitted.
Pathogen genome alignment. Coverage density plot of sequencing reads from respective sample and control detected in PaRCA aligned to reference genomes of HSV1 (a), VZV (b), JCV (c), S. pneumoniae (d) and EBV (e, f). Number of reads (y-axis) at each nucleotide position of the genome (x-axis) depicted in blue. Dark blue represents peak, bright blue average and light blue minimum coverage for respective sections of the genome.
All pathogen reads and control from PaRCA were mapped against the corresponding genome sequences using CLC genomics workbench (Fig. 2; Additional 1: Fig. 3). A dispersed distribution of the reads to the corresponding genomes was observed for all samples, except sample 10, where 5 of the 7 VZV reads (1 overlapping read) originate from a repetitive region within the genome and is therefore expected to be detected at a higher rate, and the last 2 reads map to a downstream gene (no overlap) (Additional 1: Fig. 3d). Each sequencing library was subjected to BLAST using the respective reference pathogen genome. The variation of the absolute number pathogen reads comparing the different classifiers detected was lower than 25% (Table 1).
Calculated pathogen reads and detected pathogens in bioinformatic classifiers. Samples containing HSV1 (dark blue), VZV (blue) and EBV (light blue) with a calculated value of more than one read (a) were plotted against the number of reads detected by PaRCA. Regression line depicting a direct proportionality between the calculated and observed variables. The Spearman’s rank correlation coefficient and p-value is indicated. Mean number ± SEM of viral (b) and bacterial species (c) classified in samples and controls using the different bioinformatic classifiers. Dark blue bars show the total number of species classified, bright blue bars show the amount of bacterial species over the fraction cutoff (≥ 0.01% of the dataset), light blue bars show number of species not removed using controls. Purple bars show controls and light gray show controls over the fraction cutoff (≥ 0.01%). Ordinary one-way ANOVA with Tukey’s multiple comparisons, *p value < 0.05, **p value < 0.01, ***p value < 0.001, ****p value < 0.0001.
Qualitative and quantitative detection of a known pathogen can thus reproducibly be carried out using the different types of bioinformatic classifiers. Furthermore, an estimation of sensitivity for pathogens can be generated for each sample which can guide the clinician whether the sequencing depth is sufficient to find a certain type of pathogen (Additional 4: Table 4). Notably however, each classifier produced diverse quantities of false positive hits.
False positive pathogens
The diversity of viral species detected in metagenomic sequencing libraries were relatively low and recurrent. PaRCA, Kraken2, Centrifuge and CosmosID identified 2–31, 5–13, 17–96 and 0–4 viral species in each sample respectively (Fig. 3b; Additional 1: Fig. 2a, Additional 4: Table 5). Many of the most abundant viral species identified were found in multiple samples (Fig. 3). Two samples (4 and 13) contained human viruses which were not detected in multiple samples and not a previously confirmed pathogen (see below).
The non-pathogen/EBV viral reads were either of human origin, misclassified or contaminations. Human endogenous retrovirus K was identified in all samples, except for the water control, which was expected as the reads originate from the human genome (Fig. 4 bottom; Additional 4: Table 5). Another ubiquitously detected virus was the BeAN 58,058 virus, which was detected in all samples, except for the water control. An additional BLAST examination identified these hits as human reads. Low levels of phage sequences known to infect bacteria from the Enterobacteriales order were detected in a few samples and in the water control, most likely derived from bacteria purified enzymes used in the various steps of library preparation. A conspicuous pseudomonas phage contaminant in sample 4, 5 and the water control are likely derived from a bacterial contaminant at one of the sequencing sites. Streptococcus phage species were detected in sample 6, from a patient with S. pneumoniae meningitis. Importantly, the most prominent viral species identified in patient samples were also present in the cell controls at similar levels and displayed a similar sequence identity and could therefore be discarded as a pathogen.
Compared with the relatively few viral agents detected by the classifiers, bacterial species were abundant; 61–712 bacterial species were identified using PaRCA, 370–1408 in Kraken2, 845–2826 in Centrifuge and 0–14 in CosmosID (Fig. 3c; Additional 1: Fig. 2b). Two samples originated from patients with a known S. pneumoniae meningitis (sample 6 and 7) and bacteria were detected at 69,088 ppm and 803 ppm respectively (PaRCA). With the omission of the positive samples 6 and 7, trace levels (3.4–18.2 ppm) of S. pneumoniae was ubiquitously detected in all samples. A known environmental contamination of Pseudomonas was detected in the majority of the samples. In two samples (4 and 5) Pseudomonas constituted 389,480 ppm (39%) and 590,195 ppm (59%) of the entire sequencing library respectively, while the prevalence in other samples were lower 6.6–75,279 ppm (0.0007–7.5%). A large fraction of the detected bacteria was still left when using previously suggested fixed cut-off at 100 ppm (0.01%) (Fig. 3c; Additional 1: Fig. 2b) and unlike the virus species the contaminants/misclassifications cannot be entirely removed using the control samples. However, when further applying an additional filter of comparison of the detected bacterial species between the three classifiers (PaRCA, Kraken2 and Centrifuge) only the known pathogen (S. pneumoniae) or environmental contaminants (Pseudomonas and Escherichia coli) was left. Similarly no eukaryotic species were found in all three classifiers.
Considering the ubiquitous presence of viral misclassifications and contaminants in samples as well as controls, a viral pathogen is easily identifiable, but requires additional analyses including read distribution and BLAST analysis, for verification in a clinical setting (discussed below). In contrast, the large number of bacterial species identified pose a bioinformatic challenge as the bacterial sequence can be derived from kit contaminants, lab environment or bioinformatic misclassifications which obscure the pathogen reads. As with the virus hits, removal of bacterial contaminants using cell controls can efficiently remove the majority of species, but additional filters are required (Fig. 1).
Controls
Two types of controls, water and cell control, were tested for their ability to mirror the bioinformatic misclassifications and contaminations observed in samples. In the water control the dataset consisted of 99.6% bacterial sequences and 0.06% viral sequences (Additional 4: Table 5). The cell controls originating from EBV-transformed cancer cells had a composition more similar to the samples with 99.2–99.4% human sequences. The number of viral and bacterial strains detected in the water control was 12 and 568 respectively. In contrast the cell controls contain sequences ranging from 3–4 viral and 61–177 bacterial strains.
The viral strains in the water control were mainly of phage origin. In contrast the viral strains detected in the cell controls were similar to the CSF samples, mainly Human endogenous retrovirus K and BeAN 58,058 virus. Both cell lines originate from EBV-transformed cancer cells and harbors EBV DNA. The ppm-values of each cell line between sequencing runs were reproducible and no significant difference was found between the classifiers (Additional 1: Fig. 4, Additional 3: Table 6).
In the water control, 98% of the sequencing library consisted of reads from Pseudomonas and the second most abundant bacterial strain found was Escherichia coli (0.1%), which is to be expected as most enzymes are produced in this bacterial system. In contrast, none of the bacterial strains in the cell controls constituted more than 0.1% of the sequencing library.
Thus, the water control efficiently amplified the environmental and kit contaminants, but in contrast to the cell control did not find human misclassifications. Also, since the water control consisted entirely of contaminants, the absolute or proportionate content did not allow for a direct comparison with the patient samples. The cell control allowed for direct quantitative and qualitative subtraction of the majority of contaminants and putative pathogens were identified.
Unexpected virus findings
In sample 2 and 3 we identified 29–34 EBV reads in both samples in all classifiers (Additional 4: Table 5). The reads were dispersed throughout the genome and displayed minor sequence Pathogen detection in DNA metagenomic sequencing in CSF is mainly limited by the leukocyte count which affects the sensitivity and bioinformatically with the reference genome in accordance with previous EBV findings (Additional 1: Fig. 5a, b). Due to the limited sample volume we were unable to verify and quantify this finding using qPCR.
In sample 4 we identified three viruses which were unexpected, mastadenovirus, papillomavirus and torque teno virus (Additional 1: Fig. 5c–e). PaRCA identified 32 reads matching human mastadenovirus C (HAC), Kraken2 32 reads, Centrifuge 30 reads and CosmosID did not report any HAC sequences. The majority of reads, 25 out of 32 were 198 bp long, 5 reads were shorter and 2 were longer. BLAST-analysis showed that all reads shared the same 3’-end. Four reads had mismatches in comparison with the reference sequence. Considering the size and distribution of the reads our findings are most likely a laboratory amplicon contamination. Human papillomavirus (HPV) reads were detected in PaRCA (12 reads), Kraken2 (2 reads), but not by Centrifuge and CosmosID. Ten of the 12 reads were 105 bp long and the remaining two, 104 bp and 106 bp respectively. All reads aligned to the 3’-end of the virus genome in the L1 gene. Examination of BLAST results showed a high similarity with HPV98 with a one or two base pair mismatch. As above, considering the size and distribution of the reads our findings were most likely a laboratory amplicon contamination. CosmosID has an inbuilt function to filter out hits that are considered to be amplicons, therefore the software did not report these reads. Different strains of Anellovirus/Torque teno virus (TTV) were detected in the classifiers. PaRCA identified 75 reads, Kraken2 25 reads, Centrifuge 55 reads, while CosmosID did not detect any TTV reads. Five distinct consensus reads/contigs were formed from the 75 reads identified in PaRCA. Thirty-one reads formed a consensus read of 196 bp. BLAST analysis of this read displayed a 97% identity with TTV14, but only for 91 bp of the fragment. The remaining parts of the contig did not show any alignment with any viral species. The origin of this read is therefore unknown. BLAST analysis of the remaining 4 reads/contigs showed alignment (> 95% query cover and identity) to an Anellovirus isolate previously identified in metagenomics. The alignment showed an unusual coverage of the 5’-end of the genome and all the reads were aligned to the first half of the genome. The reason for this unusual coverage is unknown, but considering that TTV is widely detected in metagenomic sequencing and the multiple reads aligning to a clinical isolate it is probable that these four contigs/reads originate from the patient sample.
In sample 13, we detected 10 reads corresponding to hepatitis C virus (HCV) in PaRCA. Kraken2, Centrifuge and CosmosID detected 5, 6 and 6 reads respectively. The 10 reads were concentrated to the 5’-end of the genome, but spread within the initial half of the genome (Additional 1: Fig. 5f.). An analysis of the BLAST results showed alignment with HCV genotype 1. Synonymous mutations were found in multiple reads as well as gaps. Two reads had a fusion between sequences from different regions of the HCV genome. The sequence diversity indicates that the virus is from a patient, but the frameshift and fusion reads indicate that they are of an artificial origin. Also, the patient had undergone HCV serology analysis which was negative. Finally, considering that HCV is a RNA virus this finding is most likely a laboratory amplicon contamination.