Documents for GSA

Items in each metadata object

The items in each metadata object (Version 2.1) containing detailed data items descriptions is freely available here.



GSA Submission Quick Start Guide

The GSA Submission Quick Start Guide (Version 2.1) containing submission descriptions is freely available CN  US.



Tutorial

GSA Data Model

Designed for compatibility, Genome Sequence Archive (GSA) follows Nucleotide Sequence Database Collaboration (INSDC) data standards and structures. Organizational framework of the GSA data is based on the concepts of BIOPROJECT (corresponds to PROJECT in the BioProject database), BIOSAMPLE (corresponds to SAMPLE in the BioSample database), EXPERIMENT, and RUN.

Figure 1. Data model in GSA



Organization of metadata objects

Followings are examples of metadata. Submitters can organize meta data objects flexibly.

♦   Comparative genome sequencing of three strains (paired-end) Include paired-end read files in a Run(Figure 2).

Figure 2. Comparative genome sequencing of three strains (paired-end)



♦   Technical and biological replicates.

Figure 3. Technical and biological replicates



Data submission and retrieval

To create a submission, users need to register and log into the BIG Data Center Submission Portal (BIG Sub,http://bigd.big.ac.cn/gsub/). In order to simplify the submission procedure, GSA is equipped with a user-friendly input wizard for data submission (Figure 4).

♦   All data associated with the same BIOPROJECT should be submitted to a single GSA.

♦   EXPERIMENT and RUN objects contain instrument and library information and are directly associated with sequence data.

♦   Each EXPERIMENT is a unique sequencing result for a specific sample.

♦   Paired-end data files (forward/reverse) must be listed together in the same RUN in order for the two files to be correctly processed as paired-end.

Figure 4. Graphic illustration of data submissions to GSA



Release of linked BioProject/BioSample/GSA

Linked BioProject, BioSample, and GSA data are released as follows (Figure 5): Release of the BioProject records DO NOT trigger release of the other linked data. Release of the BioSample records JUST triggers release of BioProject; however, DO trigger release of the referencing GSA. Release of the GSA nucleotide sequence data DO trigger release of the linked BioProject and BioSample records.

Figure 5. Release of linked BioProject/BioSample/GSA



Release Policies and Disclaimers

1. A date can be set by authors to withhold the release of new submissions for a specified period.

2. The release date can be changed through the BIG Sub portal:http://bigd.big.ac.cn/gsub/submit/gsa/[substitute your GSA accession number]/contents

3. If a paper citing the sequence or accession number is published prior to the specified date, the sequence will be released upon publication. Otherwise, GSA will release sequence data on the specified date.

4. As soon as they are available, please send the full publication data--all authors, title, journal, volume, pages and date to the following address: gsa@big.ac.cn


Frequently Asked Questions

Answers to some of the most frequently asked questions submitted to the GSA are listed as follows.
  1. Introduction
    1. What is GSA?
    2. How can I submit data to GSA?
  2. GSA Accounts
    1. How do I acquire a BIG Sub account?
    2. I have forgotten my BIG Sub username and password.
  3. Data Entry and Transmit
    1. How do I get started?
    2. How do I start a batch submission?
    3. How do I connect to the GSA data by FTP?
    4. What is your data file format?
    5. How do I name the transmitted data files’ names?
    6. What is an MD5 checksum and how do I compute it?
  4. Data release and cite
    1. How do share your data?
    2. How do make data publicly available?
    3. Which accession numbers should be cited in my publication?
  5. Help
    1. Contact information
    2. Collaboration & Visit

  1. Introduction
    1. What is GSA?

      GSA is shorten for Genome Sequence Archive, a data repository for genome, transcriptome and other omics primitive sequencing data. It archives raw sequence data produced from a wide variety of sequencing platforms. GSA is one of database resources in BIG Data Center (BIGD), part of Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), serving as a primary archive of genome sequencing data for worldwide institutions and laboratories.


    2. How can I submit data to GSA?

      Only registered users can submit data using BIG Submission (BIG Sub,http://bigd.big.ac.cn/gsub/) Portal. Please refer to the GSA Submission Quick Start Guide.


  2. GSA Accounts
    1. How do I acquire a BIG Sub account?

      Any user can freely register and create a BIG Sub account.After your registration data is submitted, a confirmation email will be automatically sent to you for activating your account.


    2. I have forgotten my BIG Sub username and password.

      1) If you just have forgotten your password, you can reset the password by clicking “Forgot password”.

      2) By clicking the “Send me reset password instructions” button, you will receive an e-mail and please follow the URL to reset your password within 30 minutes.

      3) Please use tempory reset password to relogin first, and then click the “Reset Password” to reset password number.

      If you have any problems about your account usage, please email  bigd-admin@big.ac.cn for assistance.


  3. Data Entry and Transmit
    1. How do I get started?

      After logging on the login system, you can follow steps below to finish the submission:

      1) Create new GSA submission in GSA database.

      2) Register your project (BioProject) and biological samples (BioSamples) if you did not register them before at BioProject and BioSample databases, respectively. Please refer to the GSA Submission Quick Start Guide.

      3) Submit GSA metadata -information that will link your project, samples/experiments and file names.

      4) Upload sequence data files by FTP.


    2. How do I start a batch submission?

      If the submission contains more than 10 BioSamples, a batch submission offline is preferred, please follow steps below to finish the submission:

      1) Please enter GSA database to select the batch submission.

      2) By Clicking “Batch Submission ", you use the “Download Excel” button to download the submission template. Then fill in the required items and send by email to gsa@big.ac.cn.


    3. How do I connect to the GSA data by FTP?

      In the current version of GSA, it is highly recommended that you submit your files using a dedicated FTP tool (such as FileZilla Client) to log in to the FTP server, follow the tools instruction to set the transfer mode; If you are using FTP command, type the “binary” command before the “mput” command.

      Transmitting your data files to the GSA FTP site

      Address: ftp://submit.big.ac.cn

      User and Password are same as you login the BIG Sub

      NOTICE: Navigate (use command cd) to GSA folder in the Remote Site box. Then upload files will be removed after the whole submission is finished processing.

      After finishing all above tasks, GSA team will check your information and files, and give your feedback.


    4. What is your data file format?

      In the current version, we recommend that read data is either submitted in FASTQ or BAM format. In addition, GSA only accepts GZIP and BZIP2 compression formats (and DOES NOT accepts 7-ZIP, RAR or TAR). In addition, GSA does not accept multiplexed data.


    5. How do I name the transmitted data files’ names?

      The data files are submitted in FASTQ, listed in RUN and merged into one or two sequence archive file (please do not exceed 10 GB). Single reads (Fragment) must be submitted using a single archive file and can be named with the suffix appended, i.e. DRT_10107_1.clean.fq.gz. Paired-end data files (forward/reverse), conversely, MUST be listed in a single RUN in order. For example, forward and reverse reads are alternate in the file and are named in order with “F” and “R” appended, respectively, i.e. DRT_10107_F.clean.fq.gz;DRT_10107_R.clean.fq.gz.


    6. What is an MD5 checksum and how do I compute it?

      MD5 checksums are used to verify the integrity of transmitted data. An MD5 checksum is a 32-character alphanumeric string like "e3b5dd475c449300dd11f258538ff494".

      ♦  For Linux users, use: $ md5sum

      ♦  For Mac users, use: $ md5

      ♦  Windows users need to use a third-party tool, e.g.  winmd5free.


  4. Data release and cite
    1. How do share your data?

      After accessing the GSA database through the BIG Sub account, please find the “Share” button in the last column “Operation” of this list as shown below.

      By clicking the “Share” tab, you can get the “Shared URL” as shown in the figure below. You can copy and paste the URL to editors, and then they can peer review your data.


    2. How do make data publicly available?

      After the article published, you can click on the "Release Now" button in the last column “Operation” of the list as shown below.

      Please Click "Yes" in the "Confirmation Box" to trigger GSA release. The release of GSA will trigger the release of BioProject and BioSample, so you DO NOT need to release BioProject and BioSample in their respective system separately.

      NOTICE:Data can be searched and downloaded in the GSA database as soon as they are archived.


    3. Which accession numbers should be cited in my publication?

      When you have successfully submitted data to GSA, please consider to use the following words to describe data deposition in your manuscript:

      The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2017) in BIG Data Center (Nucleic Acids Res 2018), Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under accession numbers CRAxxxxxx, CRAyyyyyy that are publicly accessible at  http://bigd.big.ac.cn/gsa.

      ♦  GSA: Genome Sequence Archive. Genomics, Proteomics & Bioinformatics 2017. [PMID=28387199]

      ♦  Database Resources of the BIG Data Center in 2018. Nucleic Acids Res 2018. [ PMID=29036542]


  5. Help
    1. Contact information

      If you have any question or would like to give us any suggestion/comment or report a bug, please feel free to contact us via email gsa@big.ac.cn or Instant Messaging Software (QQ Group:548170081).


    2. Collaboration & Visit

      We are also happy if you would like to have a visit to explore the possibility for collaboration or learn more about GSA,

      Address:

            BIG Data Center

            Beijing Institute of Genomics, Chinese Academy of Sciences

            No.1 Beichen West Road, Chaoyang District

            Beijing 100101, China

            Tel: +86 (10) 8409-7340