The Pfam protein families database in 2019. [PMID: 30357350]
Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi, Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, Erik L L Sonnhammer, Layla Hirsh, Lisanna Paladin, Damiano Piovesan, Silvio C E Tosatto, Robert D Finn
The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors' ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.
Nucleic Acids Res 2019:47(D1)
12 Citations (from Europe PMC, 2019-07-27)
The Pfam protein families database: towards a more sustainable future. [PMID: 26673716]
Robert D Finn, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Jaina Mistry, Alex L Mitchell, Simon C Potter, Marco Punta, Matloob Qureshi, Amaia Sangrador-Vegas, Gustavo A Salazar, John Tate, Alex Bateman
In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Nucleic Acids Res 2016:44(D1)
1531 Citations (from Europe PMC, 2019-09-03)
Pfam: the protein families database. [PMID: 24288371]
Robert D Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, Erik L L Sonnhammer, John Tate, Marco Punta
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.
Nucleic Acids Res 2014:42(Database issue)
2066 Citations (from Europe PMC, 2019-09-27)
The challenge of increasing Pfam coverage of the human proteome. [PMID: 23603847]
Jaina Mistry, Penny Coggill, Ruth Y Eberhardt, Antonio Deiana, Andrea Giansanti, Robert D Finn, Alex Bateman, Marco Punta,
It is a worthy goal to completely characterize all human proteins in terms of their domains. Here, using the Pfam database, we asked how far we have progressed in this endeavour. Ninety per cent of proteins in the human proteome matched at least one of 5494 manually curated Pfam-A families. In contrast, human residue coverage by Pfam-A families was <45%, with 9418 automatically generated Pfam-B families adding a further 10%. Even after excluding predicted signal peptide regions and short regions (<50 consecutive residues) unlikely to harbour new families, for ?38% of the human protein residues, there was no information in Pfam about conservation and evolutionary relationship with other protein regions. This uncovered portion of the human proteome was found to be distributed over almost 25 000 distinct protein regions. Comparison with proteins in the UniProtKB database suggested that the human regions that exhibited similarity to thousands of other sequences were often either divergent elements or N- or C-terminal extensions of existing families. Thirty-four per cent of regions, on the other hand, matched fewer than 100 sequences in UniProtKB. Most of these did not appear to share any relationship with existing Pfam-A families, suggesting that thousands of new families would need to be generated to cover them. Also, these latter regions were particularly rich in amino acid compositional bias such as the one associated with intrinsic disorder. This could represent a significant obstacle toward their inclusion into new Pfam families. Based on these observations, a major focus for increasing Pfam coverage of the human proteome will be to improve the definition of existing families. New families will also be built, prioritizing those that have been experimentally functionally characterized. Database URL: http://pfam.sanger.ac.uk/
Database (Oxford) 2013:2013()
5 Citations (from Europe PMC, 2019-07-27)
The Pfam protein families database. [PMID: 22127870]
Marco Punta, Penny C Coggill, Ruth Y Eberhardt, Jaina Mistry, John Tate, Chris Boursnell, Ningze Pang, Kristoffer Forslund, Goran Ceric, Jody Clements, Andreas Heger, Liisa Holm, Erik L L Sonnhammer, Sean R Eddy, Alex Bateman, Robert D Finn
Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.
Nucleic Acids Res 2012:40(Database issue)
2008 Citations (from Europe PMC, 2019-09-03)
The Pfam protein families database. [PMID: 19920124]
Robert D Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E Pollington, O Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik L L Sonnhammer, Sean R Eddy, Alex Bateman
Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).
Nucleic Acids Res 2010:38(Database issue)
1804 Citations (from Europe PMC, 2019-09-03)
The Pfam protein families database. [PMID: 18039703]
Robert D Finn, John Tate, Jaina Mistry, Penny C Coggill, Stephen John Sammut, Hans-Rudolf Hotz, Goran Ceric, Kristoffer Forslund, Sean R Eddy, Erik L L Sonnhammer, Alex Bateman
Pfam is a comprehensive collection of protein domains and families, represented as multiple sequence alignments and as profile hidden Markov models. The current release of Pfam (22.0) contains 9318 protein families. Pfam is now based not only on the UniProtKB sequence database, but also on NCBI GenPept and on sequences from selected metagenomics projects. Pfam is available on the web from the consortium members using a new, consistent and improved website design in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/), as well as from mirror sites in France (http://pfam.jouy.inra.fr/) and South Korea (http://pfam.ccbb.re.kr/).
Nucleic Acids Res 2008:36(Database issue)
1265 Citations (from Europe PMC, 2019-09-03)
Pfam: clans, web tools and services. [PMID: 16381856]
Robert D Finn, Jaina Mistry, Benjamin Schuster-Böckler, Sam Griffiths-Jones, Volker Hollich, Timo Lassmann, Simon Moxon, Mhairi Marshall, Ajay Khanna, Richard Durbin, Sean R Eddy, Erik L L Sonnhammer, Alex Bateman
Pfam is a database of protein families that currently contains 7973 entries (release 18.0). A recent development in Pfam has enabled the grouping of related families into clans. Pfam clans are described in detail, together with the new associated web pages. Improvements to the range of Pfam web tools and the first set of Pfam web services that allow programmatic access to the database and associated tools are also presented. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://pfam.cgb.ki.se/).
Nucleic Acids Res 2006:34(Database issue)
1248 Citations (from Europe PMC, 2019-09-03)
The Pfam protein families database. [PMID: 14681378]
Alex Bateman, Lachlan Coin, Richard Durbin, Robert D Finn, Volker Hollich, Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, Simon Moxon, Erik L L Sonnhammer, David J Studholme, Corin Yeats, Sean R Eddy
Pfam is a large collection of protein families and domains. Over the past 2 years the number of families in Pfam has doubled and now stands at 6190 (version 10.0). Methodology improvements for searching the Pfam collection locally as well as via the web are described. Other recent innovations include modelling of discontinuous domains allowing Pfam domain definitions to be closer to those found in structure databases. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://Pfam.cgb.ki.se/).
Nucleic Acids Res 2004:32(Database issue)
1846 Citations (from Europe PMC, 2019-09-03)
Enhanced protein domain discovery by using language modeling techniques from speech recognition. [PMID: 12668763]
Lachlan Coin, Alex Bateman, Richard Durbin
Most modern speech recognition uses probabilistic models to interpret a sequence of sounds. Hidden Markov models, in particular, are used to recognize words. The same techniques have been adapted to find domains in protein sequences of amino acids. To increase word accuracy in speech recognition, language models are used to capture the information that certain word combinations are more likely than others, thus improving detection based on context. However, to date, these context techniques have not been applied to protein domain discovery. Here we show that the application of statistical language modeling methods can significantly enhance domain recognition in protein sequences. As an example, we discover an unannotated Tf_Otx Pfam domain on the cone rod homeobox protein, which suggests a possible mechanism for how the V242M mutation on this protein causes cone-rod dystrophy.
Proc Natl Acad Sci U S A 2003:100(8)
28 Citations (from Europe PMC, 2019-09-03)
The Pfam protein families database. [PMID: 11752314]
Alex Bateman, Ewan Birney, Lorenzo Cerruti, Richard Durbin, Laurence Etwiller, Sean R Eddy, Sam Griffiths-Jones, Kevin L Howe, Mhairi Marshall, Erik L L Sonnhammer
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.
Nucleic Acids Res 2002:30(1)
1365 Citations (from Europe PMC, 2019-09-03)
The Pfam protein families database. [PMID: 10592242]
A Bateman, E Birney, R Durbin, S R Eddy, K L Howe, E L Sonnhammer
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the WWW in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgr.ki.se/Pfam/ and in the US at http://pfam.wustl.edu/. The latest version (4.3) of Pfam contains 1815 families. These Pfam families match 63% of proteins in SWISS-PROT 37 and TrEMBL 9. For complete genomes Pfam currently matches up to half of the proteins. Genomic DNA can be directly searched against the Pfam library using the Wise2 package.
Nucleic Acids Res 2000:28(1)
758 Citations (from Europe PMC, 2019-09-03)
Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. [PMID: 9847196]
A Bateman, E Birney, R Durbin, S R Eddy, R D Finn, E L Sonnhammer
Pfam is a collection of multiple alignments and profile hidden Markov models of protein domain families. Release 3.1 is a major update of the Pfam database and contains 1313 families which are available on the World Wide Web in Europe at http://www.sanger.ac.uk/Software/Pfam/ and http://www.cgr.ki.se/Pfam/, and in the US at http://pfam.wustl.edu/. Over 54% of proteins in SWISS-PROT-35 and SP-TrEMBL-5 match a Pfam family. The primary changes of Pfam since release 2.1 are that we now use the more advanced version 2 of the HMMER software, which is more sensitive and provides expectation values for matches, and that it now includes proteins from both SP-TrEMBL and SWISS-PROT.
Nucleic Acids Res 1999:27(1)
299 Citations (from Europe PMC, 2019-09-03)
Pfam: multiple sequence alignments and HMM-profiles of protein domains. [PMID: 9399864]
E L Sonnhammer, S R Eddy, E Birney, A Bateman, R Durbin
Pfam contains multiple alignments and hidden Markov model based profiles (HMM-profiles) of complete protein domains. The definition of domain boundaries, family members and alignment is done semi-automatically based on expert knowledge, sequence similarity, other protein family databases and the ability of HMM-profiles to correctly identify and align the members. Release 2.0 of Pfam contains 527 manually verified families which are available for browsing and on-line searching via the World Wide Web in the UK at http://www.sanger.ac.uk/Pfam/ and in the US at http://genome.wustl. edu/Pfam/ Pfam 2.0 matches one or more domains in 50% of Swissprot-34 sequences, and 25% of a large sample of predicted proteins from the Caenorhabditis elegans genome.
Nucleic Acids Res 1998:26(1)
370 Citations (from Europe PMC, 2019-09-03)
Pfam: a comprehensive database of protein domain families based on seed alignments. [PMID: 9223186]
E L Sonnhammer, S R Eddy, R Durbin
Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences.
552 Citations (from Europe PMC, 2019-09-03)