Download Pathogen Database returns too few reference sequences

Issue description

Multiple issues have been identified with the Download Pathogen Database tool that result in only a subset of the intended data being downloaded, and in the data that is downloaded coming from outdated sources.
This issue affects all searches where “NCBI Pathogen Detection” has been selected.
Depending on the choice made when downloading a pathogen reference database:

  • The database may be incomplete
  • The data contained is based on outdated or superseded knowledge

Results of downstream analyses making use of affected reference data sets should be considered critically, as they will be based on a partial, possibly outdated, dataset. For example, the use of incomplete genomic assemblies could lead to incorrect strain identification and recent outbreak data may be missing completely.

Recommendations

For retrieving genomic data, we recommend using the Download Custom Microbial Reference Database tool, which is not affected by this issue.

To download genomes from NCBI RefSeq

  • Use Download Custom Microbial Reference Database
  • Uncheck the “Skip Database Builder” option.
  • In the “Database Builder” table, use the filtering options to limit the view for the organism(s) of interest, e.g. Escherichia coli and Acinetobacter
  • Make use of the appropriate selection in the lower left corner or use additional filtering criteria for the “In RefSeq” column, and select the organisms of interest before clicking on “Include”.
  • Finally, to download, press the “Download selection” button.

To download data from NCBI Pathogen Detection

  • After clicking on the “Finish” button, wait for the Database Builder table to appear. Select the relevant data to download, as described in the manual.
  • Proceed to download the data.

The data downloaded can now be annotated using the Create Annotated Sequence List tool by matching on the “Assembly ID” column .
The metadata file from the NCBI can be used for this, by renaming it so the suffix is .txt. E.g. for the file mentioned above, the name should be changed to “PDG000000003.1542.metadata.txt”.

Note that renaming the headers will make the results more readable, e.g. “asm_level” to “Assembly Level”.

Affected software

  • CLC Microbial Genomics Module 1.5 through 21.1

This issue was addressed in MGM 21.1.1

Sample to Insight
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.