Latest improvements for QIAGEN CLC Genomics Workbench
QIAGEN CLC Genomics Workbench 25.0
Release date: 2024-12-03
New tools and workflows
Long read handling
New tools and template workflows for analyzing long, next-generation sequencing reads:
- Import Oxford Nanopore Reads
- Map Long Reads to Reference
- Structural Variant Caller for Long Reads
- RNA-Seq Analysis for Long Reads
- De Novo Assemble Long Reads
- Polish Contigs with Reads
- De Novo Assemble Long Reads and Polish with Short Reads template workflow
The tools listed above were formerly distributed in the Long Read Support plugin. Improvements and bug fixes made relative to earlier releases:
- De Novo Assemble Long Reads
- Hifiasm has been updated to version 0.19.9. This third-party de novo assembler is used for assembling PacBio HiFi reads. Results may consequently differ compared with results from earlier versions.
- Raven has been updated to version 1.8.3. This third-party de novo assembler is used for assembling Oxford Nanopore or PacBio non-HiFi reads. Changes to polishing within the tool now lead to references with fewer insertions when compared to the reads.
- In the assembly graph output, hovering the mouse cursor over a contig reveals a tooltip containing the contig's name and length.
- Map Long Reads to Reference
- Reads that wrap around a chromosome more than once are now counted as Unmapped reads and reported as such in the report. Previously, these reads were silently ignored.
- In the stand-alone read mapping output, individual read mappings are now assigned the suffix '_mapping', instead of ' mapping'. This change improves compatibility with downstream tools.
- minimap2 has been upgraded to version 2.28. This third-party mapper is used by the following tools: Map Long Reads to Reference and RNA-Seq Analysis for Long Reads. Results may thus differ compared with results from earlier versions.
- For RNA-Seq Analysis for Long Reads, the log file includes the number of reads that wrap around a chromosome more than once. As in earlier versions, these reads count towards expression but are not included in the reads track output.
- Polish with Reads has been renamed to Polish Contigs with Reads to clarify its intended purpose.
- The template workflow De Novo Assemble and Polish with Short Reads has been improved:
- Additional options are configurable when the workflow is launched.
- Reports from each analysis tool in the workflow are now produced as outputs. Previously, parts of these reports were included in a summary Sample Report, but the individual reports were not available to save.
- The summary Sample Report has been updated. The order of the sections has been improved and some sections from the QC for Sequencing report are no longer included. These are available in the QC report that is now saved as an output.
- The naming patterns for outputs have been updated.
- Fixed an issue causing Structural Variant Caller for Long Reads to fail when a stand-alone read mapping was provided as input.
Other new functionality
- New template workflow: Trim and Map Sanger Sequences adds trim annotations to sequences and then maps them to a reference.
- New workflow control flow elements:
- Fork Inclusion of Fork elements allows choices to be made when a workflow is launched about the downstream analysis paths to follow, including offering the potential to choose whether or not a particular part of an analysis should be run.
- Save On-the-Fly Imports Data imported on-the-fly can be saved directly after import by connecting a Save On-the-Fly Imports element to an Input element.
- Collapse Overlapping Annotations collapses overlapping annotations in an annotation track into a single annotation.
- Resize Annotations allows adjustment of the 5' and/or 3' positions of annotations in an annotation track.
- Remove Information from Track refines annotation, expression, statistical comparison, and variant tracks by keeping or removing selected information.
- Create Report from Table creates a report based on the content of the table view of a data element. The information to include in the report is configurable. This tool was formerly distributed in the CLC Microbial Genomics Module.
- Create Sample Level Heat Map for RNA-Seq creates a heat map of sample distances from RNA-seq data.
- Import Expression Data imports RNA-seq expression values from Excel, CSV or TSV files. This tool was formerly distributed in the Ingenuity Pathway Analysis plugin.
- Files in AWS S3 buckets can be browsed using the Remote Files tab, available to the right of the Navigation Area tab. This tab was previously available only if the CLC Cloud Module was installed.
Improvements
Detect and Refine Fusion Genes
Updates to this tool are likely to result in differences in the identified fusions compared to results from earlier versions.
- New options allow filtering detected fusions using lists of genes and/or fusions. This can be useful for removing known false positives, or for restricting fusion detection to genes or fusions of interest.
- Fusions are now supported by fusion spanning reads in addition to fusion crossing reads. Fusion spanning reads are included in the p-value and Z-score calculations.
- All fusions identified in the detection step can be included in the fusion WT track output. This can be useful for:
- Investigating why certain fusions have been filtered away before the refinement step.
- Identifying possible fusions based on paired reads that are mapped as broken pairs, where the breakpoint location could not be determined.
- Exon skippings and novel exon boundaries are no longer detected by default.
- Options have been renamed and reorganized in various wizard steps to better reflect their functionality.
- The following options have been removed:
- Ignore fusions of overlapping genes on opposite strands. We recommend using the new filter feature.
- Only use fusion primer reads. We recommend filtering the reads using Filter on Custom Criteria prior to running Detect and Refine Fusion Genes.
- Maximum distance for broken pair fusions. This option had no impact on results.
- The naming patterns for the tool's outputs have been updated.
- The table view of fusion tracks has been improved:
- It includes an "IPA gene view" column containing links out to QIAGEN Ingenuity Pathway Analysis when these are available, providing additional information about the fusions.
- The columns have been renamed and rearranged.
- The default view has the most typically used columns visible, with other columns initially hidden.
- The Gene column has been removed.
- In the report, the "Discarded base breakpoints" column in the "Unaligned Ends" table has been renamed to "Discarded breakpoints".
Usability
- Tools and workflows are now provided under separate menus: Tools and Workflows, respectively. These menus are available at the top of the Workbench and also in the Toolbox panel at the bottom, left side of the Workbench. The Workflows menu contains separate subfolders for workflows installed on the CLC Workbench, workflows installed on a CLC Server, and for template workflows. Previously, tools and workflows were listed under a single menu called Toolbox, and workflows installed on a CLC Workbench or CLC Server were placed in the same subfolder.
- The order and naming of the buttons in the Workbench toolbar have been updated.
- Multiple sequence alignments
- A Positional stats palette has been added to the Side Panel. Hovering the mouse cursor over a position in the alignment updates information in this palette with the frequency of bases or peptides at that position.
- An option called Alignment on top is available in the Sequence layout Side Panel palette. Enabling this option moves the aligned sequences above other information, such as the consensus, sequence logo, etc.
- A Table view has been added, containing information about the individual sequences in the alignment.
- Selecting bases in one or more sequences selects those positions for all sequences in the alignment.
- Translation table information is provided in a tooltip when the mouse cursor is hovered over a particular genetic code in the launch wizard of the Translate to Protein and Find Open Reading Frames tools or in the Translate CDS/ORF... tool that is available from a right-click menu on CDS and ORF annotations.
- When working with tables, the keyboard shortcut Ctrl + F (mac: Cmd + F) jumps the cursor into the simple filter field, when that field is visible.
- When editing workflows, the keyboard shortcut Ctrl + F (mac: Cmd + F) sets the focus on the Find search field in the Side Panel.
- Information about data available under the Download Genomes tab of the Reference Data Manager, such as its name and version, can be selected and copied.
Performance
- The runtimes of Copy Number Variant Detection (Targeted), QC for Targeted Sequencing, QC for Read Mapping and QC for Sequencing Reads have been substantially improved.
- The speed of searching and filtering variant tracks has been improved.
- The speed of Annotate with Exon Numbers, Annotate with Overlap Information, and Filter Based on Overlap has been improved.
- Maximum Likelihood Phylogeny uses less memory for large alignments than previously. Memory savings are greatest when aligning a few, long sequences.
Import
- The Tracks importer for GFF3 format files:
- Disallows mismatched chromosome lengths between the file and the provided reference.
- Supports all chromosome aliases as defined by UCSC.
- Uses the Sequence Ontology version 2024-06-05 for identifying gene-like and transcript-like annotation types.
- Includes pseudogenes in the (Gene) aggregated track.
- VCF import
- Supports all chromosome aliases as defined by UCSC.
- DUP:TANDEM symbolic alleles are imported as insertions in the variant track output.
- Three FASTQ header formats containing UMI information are now supported.
- When importing GenBank format files using Standard Import, ncRNA and rRNA annotations are named using information from one of the following qualifiers, considered in this order: "gene", "locus_tag", "product", "protein_id", "transcript_id", "note". Previously, they were always named using information from the "note" qualifier.
Workflow
- The following tools can now be included in workflows:
- The workflow launch wizard step where Reference Data Sets can be selected, "Select reference data set", has been improved.
- The stand-alone read mapping output from Assemble Sequences to Reference can be used as input to Fixed Ploidy Variant Detection, Low Frequency Variant Detection and Basic Variant Detection in workflows.
- Annotate with Exon Numbers, Annotate with Overlap Information, and Filter Based on Overlap have a single output channel, replacing the four track-type specific output channels.
Reports
- Combine Reports supports:
- Reordering the sections in the "Set order" wizard step when the inputs are sample or combined reports.
- Preserving the order defined in the "Set order" wizard step in the "Quality control" section from sample reports.
- Excluding the sample reports "Quality control" section in the "Set contents" wizard step.
- Combine Reports and Create Sample Report include options for including the number and percentage of mapped and unmapped bases from Map Reads to Reference reports.
- The JSON exporter includes the passed/uncertain/failed status of quality conditions from sample and combined reports.
- QC for Sequencing Reads reports the percentage of reads with average quality higher than 20, 25, 30 and 35. The reported values can be used as QC thresholds in Create Sample Report.
- In the results report from Copy Number Variant Detection (Targeted), the genome and chromosome plots have been improved, including:
- Updates to axis labels.
- An improved color scheme.
- Coloring CNVs according to whether they are gains (red) or losses (blue).
- Values with decimals in the Trim Reads report are now always reported to 2 decimal places.
- The shade of red used in report plots has been adjusted to make it easier to discern from other colors.
Other new features and improvements
- Annotate with Repeat and Homopolymer Information
- The repeat and homopolymer detection has been improved. This may lead to differences in results compared to those from earlier versions.
- The reference sequence is tested for homopolymers and repeats both 5' and 3' of a variant. Previously, the reference sequence was only tested to the 3' side of a variant. When different homopolymers are found to the on both sides of a variant, information about the longest of these is kept. The same is true for repeats detected on both sides of a variant. When a homopolymer is found on one side of a variant and a repeat on the other, information about both are retained.
- Length and sequence information about homopolymers and repeats is added to annotations in the variant track output.
- The maximum number of mismatches allowed in a homopolymer/repeat can now be specified when launching the tool.
- QC for Targeted Sequencing
- Mappings containing long reads are processed efficiently.
- Information about coverage from broken pairs and non-specific reads is included in the per-region statistics track.
- Both mean and median coverage are included in the gene coverage track. Previously, only the mean was reported, but it was mislabelled as the median (see the Bug fixes section below).
- QC for Sequencing Reads can efficiently process long reads.
- Filter Based on Overlap includes new options for keeping or removing parts of annotations in an annotation track that are also present in other tracks. Existing options have been renamed to better reflect their functionality.
- Filter on Custom Criteria accepts Sequence Lists as input.
- Merge Annotation Tracks can merge tracks containing similar types of annotations, such as various gene or RNA types.
- Create Consensus Sequences from Variants
- The relevant IUPAC ambiguity code is inserted instead of N in the consensus sequence when multiple SNVs are present in the same position.
- Only the most frequent insertion in the consensus sequence is included for overlapping insertions. Previously, all the overlapping insertions would be added consecutively.
- The Motif Search tool
- Regular expressions that include the character "?" are now supported, allowing for lookahead expressions.
- The table output includes name and motif columns when motif lists are used as input, and a motif column when a single sequence is analyzed.
- A report can be output that shows the number of matches for the searched motifs.
- Multiple sequence alignments can be provided as input.
- Reads tracks
- An option has been added to the Track layout Side Panel palette for highlighting unaligned ends.
- Lines indicating insertions with a frequency below the threshold set in the Hide insertions below (%) Side Panel option have been made thinner and are hidden when zoomed out to a level where other types of variants are no longer shown.
- In the Volcano plot view of Statistical Comparison Table and Tracks:
- The default color for down-regulated features is now blue, and the default color for up-regulated features is red. Previously the default was red for down-regulated features and blue for up-regulated features.
- The legend position can be adjusted.
- Expression tracks, statistical comparison tracks and variant tracks can be searched using functionality in the Find palette of the Side Panel. Previously this functionality was only available for annotation tracks.
- The table view of annotation tracks now includes the annotation type.
- Heat map elements have a table view containing the underlying values.
- The list of local databases available to search against using BLAST can be filtered and searched in the launch wizard.
- "Regulatory" has been added to the predefined list of annotation types available for nucleotide sequences, and 'regulatory_class' has been added to the predefined list of qualifier/keys available for sequence annotations.
- When exporting a zip archive, empty folders within the exported location will be included into the zip archive. When importing zip or tar archives, empty folders within the archives will be recreated in the import location.
- New policy property: workbench_save_to_server When set to ‘deny’, items created by a CLC Workbench cannot be saved to a CLC Server File System Location, e.g. data generated using a tool or workflow run on a Workbench, data copied from a CLC Workbench Data Location, folders created using the CLC Workbench, etc. The default is set to ‘allow’.
- Note: ‘run_on_workbench_when_server_is_available’ will be retired in a future release. Please use the new workbench_save_to_server policy instead.
- Various minor improvements
Bug fixes
- Detect and Refine Fusion Genes
- Fixed an issue where, when multiple possible fusions had been detected for the same pair of genes, some fusions would be incorrectly reported as having no support. See the public notification about this issue.
- Fixed an issue where 0 fusion crossing reads were assigned to fusions with the breakpoint located after the last exon of the highest-priority transcript, or the first transcript if no priorities were available.
- Fixed an issue causing the tool to fail if the input mRNA track did not contain any features of type mRNA.
- QC for Targeted Sequencing
- Fixed an issue where the mean coverage in the gene coverage track was labelled as the median coverage.
- Fixed an issue affecting the coverage report and the per-region statistics track, where coverage on insertions from reads that ended inside the insertions was ignored.
- Fixed an issue affecting the coverage report, where coverage on insertions was counted twice on overlapping targets.
- Trim Reads
- Fixed an issue that could cause automatic read-through adapter trimming to give different results when input sequence lists were not provided in the same order. This could happen when the read-through sequence was not the same in individual sequence lists. Automatic read-through adapter trimming is now done individually on each input sequence list.
- Fixed an issue where consensus sequence calculated from all identified read-through sequences were reported instead of the sequence that was actually used for automatic read-through adapter trimming.
- Annotate with Repeat and Homopolymer Information
- Fixed an issue that caused the tool to fail when annotating variants in the second to last position on a chromosome.
- Fixed an issue where annotations were not added to variants located in a homopolymer or repeat region spanning the origin of a circular reference sequence.
Other bug fixes
- Fixed an issue causing the outer ends of very long unaligned ends to not be rendered in reads tracks. This would happen on Oxford Nanopore and PacBio long read mappings in regions where no reads were aligned to the reference.
- Fixed an issue for phylogenetic trees where the horizontal scale bar would change, even though 'Fixed width on zoom' was selected and the zoom only affected the tree vertically.
- Fixed an issue where the SAM/BAM/CRAM Mapping Files and Ultima importers allowed reference synonyms when importing CRAM files, causing the importers to fail. The importers no longer allow reference synonyms for CRAM files.
- Fixed an issue causing VCF import to fail when importing VCF files with symbolic alleles having SVLEN=0. These alleles are now imported to an annotation track and assigned length zero.
- Fixed an issue causing reads wrapping around circular reference sequences more than once to be counted as mapped reads, even though they are actually discarded. These reads are now included in the unmapped reads count in the reports from Map Reads to Reference, Map Bisulfite Reads to Reference, Map Reads to Contigs, and the SAM/BAM/CRAM Mapping Files importer.
- Track-based output from Map Reads to Contigs can no longer be selected when the option to update contigs has been enabled. Previously, a reads track could be produced but the contigs in it were not updated based on the new read mapping information.
- Fixed an issue causing Standard Import of GenBank format files to not support files where the SOURCE or ORGANISM field was immediately followed by the ORIGIN field.
- Fixed an issue affecting the output from Create K-medoids Clustering for RNA-Seq where line graph legends were not displayed when Cluster 1 contained more than 10 genes.
- Fixed in issue causing local BLAST jobs to fail when searching a database located on a windows file share.
- Fixed an issue with Search for Sequences at NCBI affecting searches using a list of terms separated using OR, commas or spaces. No results for terms found were returned if the list contained one or more terms that could not be found.
- A small improvement was made affecting Create Alignment, Assemble Sequences, and Assemble Sequences to Reference when lower alignment accuracy options are selected. This change is not expected to affect results for most analyses.
- Fixed an issue where a palette in the Side Panel would disappear if, when moving it, it was placed onto its existing location.
- Fixed an issue where an error arose if the only element in a folder was deleted at the same time that new elements were being dropped into that folder.
- Various bug fixes
Reference data
The following improvements refer to data available via the Reference Data Manager:
- Under the QIAGEN Sets tab:
- Gene Exclude Lists and Fusion Exclude Lists Reference Data Elements have been added. These elements can be used to filter detected fusions in Detect and Refine Fusion Genes:
- Version ensembl_v106.1_hg38_no_alt_analysis_set
- Version refseq_GRCh38.p14_no_alt_analysis_set
- MANE genes, CDS and mRNA Reference Data Elements have been added:
- Version ensembl_mane_v1.3_hg38_no_alt_analysis_set.
- Version refseq_mane_v1.3_hg38_no_alt_analysis_set.
- Reference Data Sets have been added:
- hg38 (Ensembl MANE) containing Ensembl MANE Genes, CDS and mRNA elements.
- hg38 (RefSeq MANE) containing RefSeq MANE Genes, CDS and mRNA elements.
- Gene Exclude Lists and Fusion Exclude Lists Reference Data Elements have been added. These elements can be used to filter detected fusions in Detect and Refine Fusion Genes:
- Under the Download Genomes tab:
- gnomAD exome data is available for the reference sequence “Homo sapiens – hg38_no_alt_analysis_set”.
Changes
Tool and settings
- Create Heat Map for RNA-Seq has been renamed to Create Feature Level Heat Map for RNA-Seq. Option names have been updated for consistency with Create Sample Level Heat Map for RNA-Seq.
- Copy Number Variant Detection (CNVs) has been renamed to Copy Number Variant Detection (Targeted).
- The Core nucleotide BLAST database (core_nt) is the default for blastn and tblastx searches using BLAST at NCBI. Previously the default was Nucleotide collection (nr/nt).
- Create Tree accepts a single alignment as input. Previously it would accept multiple alignments and process each alignment individually. That behaviour is available by checking the Batch checkbox when launching Create Tree.
- The Illumina importer no longer supports .txt files.
- The column Linkage in variant tracks has been removed. That column was always empty.
- The Translocation and Total (Translocation) rows are no longer included in the "Variants table" in the report generated using InDels and Structural Variants. This row was always empty as translocations are not called by this tool.
- Tools under the Tools | Utility Tools folder have been re-ordered.
- The option to bypass proxy settings when connecting to a CLC Genomics Server from a CLC Workbench has been moved to the Workbench Preferences. It was previously an option in the CLC Server Connection dialog.
Installation
- When installing CLC Genomics Workbench 25.0 (or higher), any plugins present in a CLC Genomics Workbench from the previous major version line already present on the system will be copied to the new version. On starting the new version, a dialog will open prompting for these plugins to be updated to a version compatible with the new software. This change makes it faster and simpler to complete upgrades to CLC Genomics Workbench from an earlier version.
- On macOS, the CLC Genomics Workbench is automatically configured to open .clc files and clc:// URLs. On Windows and Linux, options to configure this behaviour will still be presented when installing the software.
Third party version updates
- The "Sequence Representation" section under the View tab in the Workbench Preferences has been renamed "Sequence Label".
- The Java version bundled with CLC Genomics Workbench 25.0 is Java 21.0.4, where we use the JRE from the Azul OpenJDK builds.
- The hmmsearch programme used in Pfam domain search has been updated to version 3.4.
- Trim Sequences has been updated with build 10.1 of the UniVec database.
- The restriction site database REBASE has been updated to version 408.
- BLAST has been upgraded to BLAST+ 2.15.0. BLAST+ changes can be viewed at http://www.ncbi.nlm.nih.gov/books/NBK131777.
Functionality retirement
The following have been retired from the Workbench:
- Remove Information from Variants. Remove Information from Track has replaced this tool.
- Import Vector NTI Database. Previously delivered by the now retired Vector NTI import plugin.
- The "Sequence Representation" option in the right-click menu in the Navigation Area. Functionality to base the display name of sequences on the accession, latin name or common name remains available using the "Sequence label" setting in the Sequence layout palette in Side Panel settings, and using the Sequence layout setting under the View tab in the Workbench Preferences.
Plugin Retirements
- Ingenuity Pathway Analysis Functionality from this plugin is now available via the Biomedical Genomics Analysis plugin.
- Long Read Support Functionality from this plugin is now available directly in the CLC Genomics Workbench. See information in the New Tools and Workflows section above.
- Vector NTI import Import tools provided by this plugin are no longer available.
Advanced notice
Legacy tools
The following tools have been moved to the Legacy folder of the Workbench Tools menu and will be retired in a future version of the software:
- Correct Long Reads (legacy). Note that this legacy tool is affected by the upgrade of the third party mapper, minimap2, to version 2.28.
Other legacy functionality
The policy property ‘run_on_workbench_when_server_is_available’ will be retired in a future release. Please use the new workbench_save_to_server policy instead.