ARCHAEA TERATA Mac OS

ARCHAEA TERATA Mac OS

June 01 2021

ARCHAEA TERATA Mac OS

  1. Archaea Terata Mac Os Download
  2. Archaea Terata Mac Os 11

The graphical user interface (GUI) is a native Microsoft Windows application that can also be used on Mac OS X. The command line M ega is available as native applications for Windows, Linux,. The Microbe Directory is a collective research effort to profile and annotate more than 7,500 unique microbial species from the MetaPhlAn2 database that includes bacteria, archaea, viruses, fungi, and protozoa. By collecting and summarizing data on various microbes’ characteristics, the project comprises a database that can be used downstream of large-scale metagenomic taxonomic analyses. Microbiology is the study of the microbes, that is, of the viruses, prokaryotes (bacteria and archaea), and the essentially one-celled eukaryotes (algae, protozoa, fungi). Study of the multi-cellular parasitic worms of humans and other animals is often included, as well. Moreover, the cell biology of higher plants and animals is also a relevant.

Purpose in short: I want to detect (and remove) potential contaminants in the genome assembly, and Kraken is a tool designed for that.

Introduction

From its webpage:

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.
In its fastest mode of operation, for a simulated metagenome of 100 bp reads, Kraken processed over 4 million reads per minute on a single core, over 900 times faster than Megablast and over 11 times faster than the abundance estimation program MetaPhlAn. Kraken’s accuracy is comparable with Megablast, with slightly lower sensitivity and very high precision.
Kraken is written in C++ and Perl, and is designed for use with the Linux operating system. We have also successfully compiled and run it under the Mac OS.

In practice

See Kraken manual for full instructions.

Install

Download the latest release.

If you want to build your own dabase, jellyfishversion 1 should be in your PATH.

Build standard Kraken database

The standard Kraken database includes bacterial, archeal, and viral genomes in Refseq at the time of the build.

If any step (including the initial downloads) fails, the build process will abort. However, kraken-build will produce checkpoints throughout the installation process, and will restart the build at the last incomplete step if you attempt to run the same command again on a partially-built database.

Build custom Kraken database

Because the standard Kraken database doesen’t include fungi and protozoa, which I want to include in my analysis.

I found refseq2kraken, which facilitates the downloading and preparation of Kraken database.

In the following part of this note, this library will be refered as to ‘Non-Masked library’.

This post Download refseq-genomic data and prepare it for Kraken is also helpful.

Mask low-complexity regions

I noticed a pull request of kraken: Fixed human genome downloading and added auto-masking feature using dustmasker, and realized that I should mask the library! So I re-generated the kraken library (I don’t know why the author of kraken didn’t mention that in their docs.)

Here is the new script:

In the following part, this library will be refered as to ‘Masked library’.

Archaea Terata Mac Os Download

Classify contigs

Non-Masked library

The ‘classified’ sequences will be saved in ‘classified.fa’, and the ‘unclassified.fa’ will be the ‘clean’ one, which can be used for downstream analysis.

ARCHAEA TERATA Mac OS

But when I looked over the ‘report’, the result quite disappointed me.

This means about 70% contigs were contaminated!

I was curious about the percentage of contaminant lengths of each read. Then I used a simple Python script to count that.

Then I used R to visualize the relationship between contig lengths and the contaminated lengths. There were 14846 contigs, among which 12004 (80.86%) contigs, 2788 (18.78%) contigs and 54 (0.36%) contigs has a contaminated rate of blow 1%, between 1% and 10%, and more than 10% respectively.

Masked library

Archaea Terata Mac Os 11

And about 17% of contigs were classified as ‘contaminant’.

There were 14846 contigs, among which 14766 (99.46%) contigs, 61 (0.41%) contigs and 19 (0.13%) contigs has a contaminated rate of blow 1%, between 1% and 10%, and more than 10% respectively.

Replace contaminant regions with N

Archaea terata mac os catalina

My collaborator asked me to give him an assembly draft for meta-genomic study, which a ‘clean’ one was needed. Removing all the contaminated sequences was not feasible, since there would be few contigs left.

Then I checked the output of Kraken, which contained the detailed information of each contig:

Because Kraken uses a k-mer mapping strategy to locate potential contaminants, there would be many contigs within which only a small part could be aligned to bacteria/virul sequences. So I retrieved the precise k-mers that mapped to contamination, and masked them with ‘N’. I discussed this with another guy: Questions about de novo genome assembly from mixed DNA samples, and he also thought it’s a vivid approach.

Finally

But I began to think how the contaminanted sequences would affect the assembly process, and maybe removing contaminants from raw data is a good way. See discussions here: Filtering for contamination when assembling a genome, before or after assemby? and Question: How to remove contamination from the transcriptome assembly.

Useful links

Change log

  • 20180424: create the note.
  • 20180620: add the “refseq2kraken” part.
  • 20180808: add “Mask low-complexity regions” and respective parts.

ARCHAEA TERATA Mac OS

Leave a Reply

Cancel reply