Create UMI Reads from Reads cannot group read pairs when R1 or R2 reads have different start positions
Issue description
Reads that have different start positions relative to the target fragment are not grouped into UMI reads using Create UMI Reads from Reads using affected software versions. Such reads may be generated by protocols where the DNA is fragmented after the UMI has been added, or when two primers are located near each other in single primer extension protocols, making it possible for one primer to amplify a PCR product generated from the other (figure 1).
Figure 1. During target enrichment, PCR products from gene specific primer 1 and the universal primer can be amplified by gene-specific primer 2 and the universal primer. This can create reads that share the UMI but originate from different primers and therefore have different start positions.
Potential impact
This issue can lead to too few reads being assigned to each UMI read so that, overall, more UMI reads are reported than should have been. This can then lead to higher coverage values being reported than should be the case. This effect is expected to be greater where many reads per UMI read are expected.
Due to this, results from fusion calling, clonotype identification, and variant calling results may be affected.
Recommendation for working with affected data sets
If using Create UMI Reads from Reads, including working with workflows containing this tool (see Affected Software, below), please update the Biomedical Genomisc Analysis plugin to a version where this issue has been addressed. Then, in the
Advanced settings of Create UMI Reads from Reads, change the "Hasher type" option from “Simple k-mer hasher” to “Rolling min k-mer hasher” in both the coarse and fine grouping sections. This will cause the computation to be done at a finer scale, resulting in successful grouping of reads with different start positions into the same UMI read.
For affected data sets, making this change is expected to result in a decrease in the number of UMI reads created, and in turn, the coverage reported should better reflect the true coverage for the sample. These differences may also lead to slightly different results for fusion detection, clonotype analysis or variant detection analyses. In addition, the annotations on consistently identified fusions, clonotypes, and variants may change, for example the estimated frequency/read count support.
If downstream tool settings and filtering steps have been optimized based on results generated after using Create UMI Reads from Reads ** with the default "Hasher type" setting, “Simple k-mer hasher”, these may need to be reviewed, e.g. settings in tools such as Detect and Refine Fusion Genes, Low frequency Variant Detection, Fixed Ploidy Variant Detection and options for filtering fusions, clonotypes, and variants.
Software affected
Create UMI Reads from Reads delivered by
- All versions of the Biomedical Genomics Analysis plugin up to and including 23.1.
- Biomedical Genomics Analysis plugin version 24.0 and 24.0.1.
The issue was addressed in Biomedical Genomics Analysis plugin versions 23.1.1 and 24.0.2 through the introduction of a new Advanced Hasher type option, “Rolling min k-mer hasher” (see the
Recommendations section above).
Template workflows delivered by all versions of the Biomedical Genomics Analysis plugin that can produce results affected by this problem are:
- Detect QIAseq RNAscan Fusions for QIAseq Targeted RNAscan Panels
- Perform QIAseq Immune Repertoire Analysis for QIAseq Immune Repertoire RNA Library Kits
- Perform QIAseq Targeted TCR Analysis for QIAseq Targeted RNA-seq Panel for T-cell Receptor
- Perform QIAseq Multimodal Analysis (Illumina) for QIAseq Multimodal HT Panels
- Perform QIAseq Multimodal Analysis with TMB and MSI (Illumina) for QIAseq Pan-Cancer Multimodal HT Panel
- Perform QIAseq RNA Fusion XP Analysis (Illumina) for QIAseq Fusion XP Targeted Panels
- Identify QIAseq DNA Somatic and Germline Variants from Tumor Normal Pair (Illumina) for QIAseq Targeted DNA Panels
Note: This problem did not affect data sets where reads that should be grouped into the same UMI read have the same start positions relative to the target fragment. It also did not affect grouping of reads into UMI reads in RNA-Seq analyses, as this does not make use of Create UMI Reads from Reads.