Released Genome Sequences

RCoV19

Data Release Statistics

Coronavirus Sequences

Isolates

SARS-CoV-2 Sequences

Submitting Labs

Originating Labs

Locations

Global Continents China

Continent	Country/Region	Genome sequences	Complete genome sequences	Human genome sequences	Human complete genome sequences	Monitoring report

Search

The new search page is publicly available, please try it out now.

Because of usage rights, sequences from GISAID cannot be downloaded here. Please log into GISAID's website to retrieve them.

Virus Strain Name	Accession ID	Gender	Age	Data Source	Related ID	Raw Data	Lineage	Nuc.Completeness	Sequence Length	Sequence Quality	Quality Assessment	Host	Sample Collection Date	Location	Originating Lab	Submission Date	Submitting Lab	Create Date	Country/Region	Province	City	Last Update Time

Genome sequences derived from different resources worldwide (NCBI, GISAID, NGDC, NMDC, CNGB) are integrated and curated based on the meta data and sequence alignment results, and this table is daily updated during COVID-19 outbreak.

Related ID: We perform data curation to provide non-redundant genome sequences, especially the whole genome sequences, facilitating users to obtain correct analysis results such as variation frequencies, phylogenetic tree.
1. Redundancies between databases: A genome sequence may be submitted to more than one database. Redundancies are identified based on meta information, sequence alignment, or the report from data submitters. We preferentially provide genome information (“Accession ID” and “Virus Strain Name”) of the databases which are publicly open to all users. Accordingly, the accession ids of other databases are listed in “Related ID”. We encourage submitters to share the data with different data centers, which would greatly benefit the curation, annotation, bioinformatics analysis and experimental studies.
2. Redundancies within database: Within a database, genome sequences with the same virus name, sequence, sample collection date, patient information, virus passage history, etc., are considered to be redundancies. This may due to repeated submission or other mistakes. In this case, accession id of the latter one will be listed in “Related ID”.
3. Whole genome sequence and gene sequence: Only the whole genome sequence is included in the table if a virus has both whole genome sequence and gene sequences, while accession ids of gene sequences are not listed in “Related ID”. All the gene sequences will be listed as different rows in the table only if the whole genome is unavailable. However, gene sequences will be removed from the table and replaced by the whole genome sequence when the whole genome sequence is obtained.

Download: The meta information could be downloaded by all users, while only sequences that are originally public are available in our database. Because of usage rights, sequences from GISAID cannot be downloaded here. Please log into GISAID's website to retrieve them.

Nuc.Completeness: The “Complete” genome sequence should cover all the protein-coding region/CDS region of SARS-CoV-2, and its length should be larger than 29k. Otherwise, the sequence will be identified as “Partial”.

Quality Assessment: The “Complete” sequences are further analyzed in the following five aspects related with sequence quality, including unknown bases (Ns)’ number, degenerate bases’ number, total gaps (deletion, insertion, indel) when aligned to reference sequence MN908947, mutation number, and mutation density (mutation number/length of mutation region; mutation region<=20nt). We calculate the total number of mutations across the whole genome, while analyze the remaining four quality items within protein-coding region.
Quality control is performed based on criteria listed in the table; green represents “pass”, while red represents “fail”. Mouse-over to view details.
Sequences tagged with red dot(s) should be used with cautions. Large number of Ns or degenerate bases, or multiple gaps suggest that there may be quality issues due to low coverage/depth/technical issues. High mutations or high mutation density should be noted by the users and checked whether there are quality issues.
Variation analysis is not performed for sequences that fail the quality control for Ns or degenerate bases. On the other hand, the sequences of non-human viruses always exhibit large number of mutations when aligned to the reference sequence, we therefore only assess the number of Ns or degenerate bases for quality control.
In the download table, “Quality Assessment” lists the corresponding number of each item. For the last item, mutation density, YES/NO represents there is/no high mutation density region. If no variation analysis is performed, “NA” is provided.

	Unknown Base(s)	Degenerate Base(s)	Total Gap(s)	Total Mutation(s)	Mutation Density
Green	<=15	<=50	<=2 gaps	<=Expected Value+1	<0.25
Red	>15	>50	>2 gaps	>Expected Value+1	>=0.25
Details	Unknown base(s): number	Degenerate base(s): number	Total gap(s): number	Total Mutation(s): number	High mutation density: starting site~ending site (length of mutation region-total mutations-mutation density); "NO" is displayed if there is no high mutation density region

The expected number of mutations is calculated based on the mutation rate, 8.69 × 10^{−4 per site per year. For details, please refer toLiu et al., 2020}

Sequence Quality: We consider a sequence to be of high quality if it could pass quality control for both Ns and degenerate bases. Otherwise, it is considered to be of low quality.

Quality Assessment: The “Complete” sequences are further analyzed in the following five aspects related with sequence quality, including unknown bases (Ns)’ number, degenerate bases’ number, total gaps (deletion, insertion, indel) when aligned to reference sequence MN908947, mutation number, and mutation density (mutation number/length of mutation region; mutation region<=20nt). We calculate the total number of mutations across the whole genome, while analyze the remaining four quality items within protein-coding region.
Quality control is performed based on criteria listed in the table; green represents “pass”, while red represents “fail”. Mouse-over to view details.
Sequences tagged with red dot(s) should be used with cautions. Large number of Ns or degenerate bases, or multiple gaps suggest that there may be quality issues due to low coverage/depth/technical issues. High mutations or high mutation density should be noted by the users and checked whether there are quality issues.
Variation analysis is not performed for sequences that fail the quality control for Ns or degenerate bases. On the other hand, the sequences of non-human viruses always exhibit large number of mutations when aligned to the reference sequence, we therefore only assess the number of Ns or degenerate bases for quality control.
In the download table, “Quality Assessment” lists the corresponding number of each item. For the last item, mutation density, YES/NO represents there is/no high mutation density region. If no variation analysis is performed, “NA” is provided.

	Unknown Base(s)	Degenerate Base(s)	Total Gap(s)	Total Mutation(s)	Mutation Density
Green	<=15	<=50	<=2 gaps	<=Expected Value+1	<0.25
Red	>15	>50	>2 gaps	>Expected Value+1	>=0.25
Details	Unknown base(s): number	Degenerate base(s): number	Total gap(s): number	Total Mutation(s): number	High mutation density: starting site~ending site (length of mutation region-total mutations-mutation density); "NO" is displayed if there is no high mutation density region

RCoV19 - Released Genome Sequences

Data Release Statistics

Released Genome Sequences

Search