- Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. [PMID: 26553804]
Nuala A O'Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, Alexander Astashyn, Azat Badretdin, Yiming Bao, Olga Blinkova, Vyacheslav Brover, Vyacheslav Chetvernin, Jinna Choi, Eric Cox, Olga Ermolaeva, Catherine M Farrell, Tamara Goldfarb, Tripti Gupta, Daniel Haft, Eneida Hatcher, Wratko Hlavina, Vinita S Joardar, Vamsi K Kodali, Wenjun Li, Donna Maglott, Patrick Masterson, Kelly M McGarvey, Michael R Murphy, Kathleen O'Neill, Shashikant Pujar, Sanjida H Rangwala, Daniel Rausch, Lillian D Riddick, Conrad Schoch, Andrei Shkeda, Susan S Storz, Hanzhen Sun, Francoise Thibaud-Nissen, Igor Tolstoy, Raymond E Tully, Anjana R Vatsan, Craig Wallin, David Webb, Wendy Wu, Melissa J Landrum, Avi Kimchi, Tatiana Tatusova, Michael DiCuccio, Paul Kitts, Terence D Murphy, Kim D Pruitt
Nucleic acids research 2016:44(D1)
198 Citations (Google Scholar as of 2017-09-07)
Abstract: The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.
- RefSeq curation and annotation of antizyme and antizyme inhibitor genes in vertebrates. [PMID: 26170238]
Bhanu Rajput, Terence D Murphy, Kim D Pruitt
Nucleic acids research 2015:43(15)
4 Citations (Google Scholar as of 2017-09-07)
Abstract: Polyamines are ubiquitous cations that are involved in regulating fundamental cellular processes such as cell growth and proliferation; hence, their intracellular concentration is tightly regulated. Antizyme and antizyme inhibitor have a central role in maintaining cellular polyamine levels. Antizyme is unique in that it is expressed via a novel programmed ribosomal frameshifting mechanism. Conventional computational tools are unable to predict a programmed frameshift, resulting in misannotation of antizyme transcripts and proteins on transcript and genomic sequences. Correct annotation of a programmed frameshifting event requires manual evaluation. Our goal was to provide an accurately curated and annotated Reference Sequence (RefSeq) data set of antizyme transcript and protein records across a broad taxonomic scope that would serve as standards for accurate representation of these gene products. As antizyme and antizyme inhibitor proteins are functionally connected, we also curated antizyme inhibitor genes to more fully represent the elegant biology of polyamine regulation. Manual review of genes for three members of the antizyme family and two members of the antizyme inhibitor family in 91 vertebrate organisms resulted in a total of 461 curated RefSeq records. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.
- RefSeq: an update on mammalian reference sequences. [PMID: 24259432]
Kim D Pruitt, Garth R Brown, Susan M Hiatt, Françoise Thibaud-Nissen, Alexander Astashyn, Olga Ermolaeva, Catherine M Farrell, Jennifer Hart, Melissa J Landrum, Kelly M McGarvey, Michael R Murphy, Nuala A O'Leary, Shashikant Pujar, Bhanu Rajput, Sanjida H Rangwala, Lillian D Riddick, Andrei Shkeda, Hanzhen Sun, Pamela Tamez, Raymond E Tully, Craig Wallin, David Webb, Janet Weber, Wendy Wu, Michael DiCuccio, Paul Kitts, Donna R Maglott, Terence D Murphy, James M Ostell
Nucleic acids research 2014:42(Database issue)
571 Citations (Google Scholar as of 2017-09-07)
Abstract: The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI's eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI's eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.
- Comparison of RefSeq protein-coding regions in human and vertebrate genomes. [PMID: 24063302]
Jessica H Fong, Terence D Murphy, Kim D Pruitt
BMC genomics 2013:14
8 Citations (Google Scholar as of 2017-09-07)
Abstract: Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI's RefSeq project and subsequently processed by NCBI's eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features. We assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian "core proteins" for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt]. Our multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.
- NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. [PMID: 22121212]
Kim D Pruitt, Tatiana Tatusova, Garth R Brown, Donna R Maglott
Nucleic acids research 2012:40(Database issue)
800 Citations (Google Scholar as of 2017-09-07)
Abstract: The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).
- NCBI Reference Sequences: current status, policy and new initiatives. [PMID: 18927115]
Kim D Pruitt, Tatiana Tatusova, William Klimke, Donna R Maglott
Nucleic acids research 2009:37(Database issue)
721 Citations (Google Scholar as of 2017-09-07)
Abstract: NCBI's Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) is a curated non-redundant collection of sequences representing genomes, transcripts and proteins. RefSeq records integrate information from multiple sources and represent a current description of the sequence, the gene and sequence features. The database includes over 5300 organisms spanning prokaryotes, eukaryotes and viruses, with records for more than 5.5 x 10(6) proteins (RefSeq release 30). Feature annotation is applied by a combination of curation, collaboration, propagation from other sources and computation. We report here on the recent growth of the database, recent changes to feature annotations and record types for eukaryotic (primarily vertebrate) species and policies regarding species inclusion and genome annotation. In addition, we introduce RefSeqGene, a new initiative to support reporting variation data on a stable genomic coordinate system.
- NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. [PMID: 17130148]
Kim D Pruitt, Tatiana Tatusova, Donna R Maglott
Nucleic acids research 2007:35(Database issue)
1911 Citations (Google Scholar as of 2017-09-07)
Abstract: NCBI's reference sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) is a curated non-redundant collection of sequences representing genomes, transcripts and proteins. The database includes 3774 organisms spanning prokaryotes, eukaryotes and viruses, and has records for 2,879,860 proteins (RefSeq release 19). RefSeq records integrate information from multiple sources, when additional data are available from those sources and therefore represent a current description of the sequence and its features. Annotations include coding regions, conserved domains, tRNAs, sequence tagged sites (STS), variation, references, gene and protein product names, and database cross-references. Sequence is reviewed and features are added using a combined approach of collaboration and other input from the scientific community, prediction, propagation from GenBank and curation by NCBI staff. The format of all RefSeq records is validated, and an increasing number of tests are being applied to evaluate the quality of sequence and annotation, especially in the context of complete genomic sequence.
- NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. [PMID: 15608248]
Kim D Pruitt, Tatiana Tatusova, Donna R Maglott
Nucleic acids research 2005:33(Database issue)
1278 Citations (Google Scholar as of 2017-09-07)
Abstract: The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff.
- NCBI Reference Sequence project: update and current status. [PMID: 12519942]
Kim D Pruitt, Tatiana Tatusova, Donna R Maglott
Nucleic acids research 2003:31(1)
193 Citations (Google Scholar as of 2017-09-07)
Abstract: The goal of the NCBI Reference Sequence (RefSeq) project is to provide the single best non-redundant and comprehensive collection of naturally occurring biological molecules, representing the central dogma. Nucleotide and protein sequences are explicitly linked on a residue-by-residue basis in this collection. Ideally all molecule types will be available for each well-studied organism, but the initial database collection pragmatically includes only those molecules and organisms that are most readily identified. Thus different amounts of information are available for different organisms at any given time. Furthermore, for some organisms additional intermediate records are provided when the genome sequence is not yet finished. The collection is supplied by NCBI through three distinct pipelines in addition to collaborations with community groups. The collection is curated on an ongoing basis. Additional information about the NCBI RefSeq project is available at http://www.ncbi.nih.gov/RefSeq/.
- RefSeq and LocusLink: NCBI gene-centered resources. [PMID: 11125071]
K D Pruitt, D R Maglott
Nucleic acids research 2001:29(1)
926 Citations (Google Scholar as of 2017-09-07)
Abstract: Thousands of genes have been painstakingly identified and characterized a few genes at a time. Many thousands more are being predicted by large scale cDNA and genomic sequencing projects, with levels of evidence ranging from supporting mRNA sequence and comparative genomics to computing ab initio models. This, coupled with the burgeoning scientific literature, makes it critical to have a comprehensive directory for genes and reference sequences for key genomes. The NCBI provides two resources, LocusLink and RefSeq, to meet these needs. LocusLink organizes information around genes to generate a central hub for accessing gene-specific information for fruit fly, human, mouse, rat and zebrafish. RefSeq provides reference sequence standards for genomes, transcripts and proteins; human, mouse and rat mRNA RefSeqs, and their corresponding proteins, are discussed here. Together, RefSeq and LocusLink provide a non-redundant view of genes and other loci to support research on genes and gene families, variation, gene expression and genome annotation. Additional information about LocusLink and RefSeq is available at http://www.ncbi.nlm.nih.gov/LocusLink/.
- NCBI's LocusLink and RefSeq. [PMID: 10592200]
D R Maglott, K S Katz, H Sicotte, K D Pruitt
Nucleic acids research 2000:28(1)
184 Citations (Google Scholar as of 2017-09-07)
Abstract: The NCBI has introduced two new web resources-LocusLink and RefSeq-that facilitate retrieval of gene-based information and provide reference sequence standards. These resources are designed to provide a non-redundant view of current knowledge about human genes, transcripts and proteins. Additional information about these resources is available on the LocusLink web site at http://www.ncbi.nlm.nih.gov/LocusLink/