ALFRED: The ALlele FREquency Database

ALFRED: The ALlele FREquency Database

1. Introduction

Alfred: The Allele Frequency Database was developed to store and disseminate allele frequency data for a wide range of anthropic polymorphisms. Data in Alfred is contributed by a variety of public domain sources (peer-reviewed scientific literature, in some cases). All data in Alfred, however, is reviewed and annotated by experienced human population geneticists to ensure that it is of the highest scientific quality. Alfred is a part of the HuGE – Human Genome Epidemiology Network and is a component of the Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health. Alfred has become an important resource for many researchers learning about allele frequency data. Alfred has also been used as a reliable source of data to assist in teaching evolution, anthropology, and other various scientific topics. In offering allele frequency data for multiple polymorphisms from a variety of world populations, Alfred is a tool that can be used for research in a wide range of scientific topics, such as human biology, evolution, history, and medicine.
The emphasis on data quality sets Alfred apart from other online allele frequency resources. The Alfred database was built by Dr. Kenneth K. Kidd and Jonathan L. Haines using Microsoft SQL Server. Most of the data was converted from an existing file-based system that contained data from published scientific literature. Dr. Kidd established the Alfred PIs and working group consisting of 7-8 human population geneticists from around the world who continue to assess the quality of data that is contributed to Alfred. The curation process includes two rounds of review. The first round checks the quality of the data for a specific polymorphism. If data successfully passes the first round, it is uploaded to the database and labeled “provisional”. At this point, the data may be viewable to users, but is not guaranteed to be accurate. The final level of data quality is labeled as “evaluated”. Data must pass an additional review to receive the evaluated label. Data with this quality level is guaranteed to be accurate and has been thoroughly worked on by our curators. Any changes or notes for the data are recorded in the evaluation history, which is also viewable to users. Data that does not pass the first round of review is sent back to the contributor with notes on what needs to be improved. This data may be resubmitted for review at a later time. If the second round of review determines that the data does not meet the Alfred standard, it is removed from the database. This intensive approach to data quality maintenance ensures that Alfred data is the most accurate representation of allelic diversity for a given polymorphism and population.

2. Features of ALFRED

This is a topic under development.
ALFRED focuses on studies of human population genetics.
It fills a gap left by dbSNP by providing information on allele frequencies.
It contains data from a variety of new and existing anthropological, epidemiological, and forensic investigations. This is data currently missing from the aforementioned SNP database.
ALFRED collects data from a variety of individuals regardless of ethnicity, and is designed to be most informative when the query may return few results from a unique ethnic population. Therefore, all allele frequencies stored in ALFRED are linked to a defined ethnic population.
The comprehensive exploratory data analysis tools found in ALFRED by researchers to interactively perform simple and complex statistical analyses.
These features make ALFRED a much more useful tool for medical sciences researchers and geneticists compared to its predecessors, because it contains information more relevant to the gene than just the mere presence of a reference SNP.

3. Data Collection and Compilation

The increasing availability and complexity of genome wide frequency data provides a formidable challenge in data organization and presentation. To deal with this level of data will require a significant investment in expertise and we anticipate that the SEQHRE project will greatly facilitate doing so. ALFRED has received a number of requests from various researchers and projects to use ALFRED as a platform for allele frequency data. Some of these groups include dbGaP and the HapMap project, but we anticipate inquiries from the 1000 Genomes Project, the various projects of the 2005 Human Genome Diversity and Admixture Mapping conferences, as well as future data from Genome-Wide Association Studies. As beneficial as these opportunities may be, they will also provide ALFRED with yet another task of understanding the data and best means to present it. Our past colloquiums have been quite helpful to data contributors and users and it is likely that regular meetings with contributors and consultants will improve the development and utility of these resources over the next year and into the future.

It was the primary intention behind the creation of ALFRED to provide access to the large body of knowledge on human allele frequencies by creating a resource to disseminate data on gene frequency from a variety of loci and analytic strategies, which can be found in different disciplines of biology and medicine. With this in mind, it is important that ALFRED’s users be able to systematically locate, retrieve, and easily interpret data on the allele frequencies of interest. Therefore, a large part of the effort at ALFRED goes into developing the most effective approach to data presentation and extract. This is an area the ALFRED Scientific Advisory Board sees as of prime importance for ALFRED to focus on, as improving data retrieval mechanisms will enhance the site’s usability and encourage use of ALFRED as a primary research tool. In order to accomplish this objective, it is imperative that ALFRED assess the state of dbGaP and HapMap and to continue to work with project leaders, as well as the other data contributors to best understand their data and how it can be presented in the ALFRED framework. Each data set that is contributed or accessed from an external source should be complemented by metadata that will allow users to know the origin, context, and potential limitations of that dataset. This will be a continuing point of emphasis for the coming year.

Data collected from published reports on gene frequencies in the protein coding loci of various populations were the initial focus of ALFRED. Due to the rich body of data on blood group allele frequencies and as a service to the human genetics community, in 2005 ALFRED added data on these loci prepared by Dr. Marion Scott and her colleagues. More recently, DBCLS added data from the HapMap, Perlegen and SeattleSNPs projects. Candidate gene data from dbGAP and other population genomics studies will be integrated in the near future. In order to provide a comprehensive view of the allele frequency distributions of various populations, it is important for ALFRED to offer access to data from the many current and future genome-wide association studies. Thus, pursuant to the recommendations of the ALFRED Scientific Advisory Board, it is important to review the requirements for data submission and to continue to make the scientific community aware of the value of depositing and retrieving these data from ALFRED.

4. Accessing and Utilizing ALFRED

The next step of the process to gaining a greater understanding of the allele frequency data found in ALFRED is learning how to perform your own analyses. This can be done using a program such as Microsoft Excel and the data that you obtain can be represented in a table or graph format. Once again, to simplify this process, we have bolded certain titles and separated different tables of data. This may be useful to users who are quite often students who are using this data to complete research assignments or problems. As it is quite similar to data given in textbooks, students may want to try example problems not based on the data found at ALFRED but by using the same race or region. For this reason, we have included a data sheet of textbook problems. We hope that you make the most out of the data available and all feedback to further develop the website is appreciated.

The third search engine will essentially filter through the data and only display records that meet specified criteria. Like the previous engine, the user will select a field and an operator (e.g., Age >20) and then a value in a form (e.g., for an age value, in this case, a number input box). Records can be defined as all data in one row of the database, and this tool is useful if initially a table of records is too long and needs to be further simplified. While this search engine is very powerful, it is targeted to users who may be more computer savvy and who are more familiar with relational databases. This is because it requires an understanding of fields, records, and tables. It is using an SQL-type language to the above-mentioned form. Although the previous two search engines are still in development, the data is already free to browse at any time.

The second search engine provides a more detailed search of data. It allows the user to search for a particular allele frequency or other data using one or more multiple alleles, race, region, or any other available data. This is done by providing drop-down boxes to select fields and entering values to search by. Data will then be returned in a table format. This may be useful, for example, to a researcher who wants to see all data for a certain allele or combination of alleles. This feature may also be useful to input data from the textbook problems.

The first search engine, once fully implemented, will be the most basic and easy to use search feature. Its target is users who may not be familiar with allele frequencies and are just browsing through the website. This feature will let users type, say, a gene name and then return a list of all alleles associated with that gene. This list will only contain summarized data, usually in the form of a table.

This section describes how to access ALFRED and how to use the database to perform your own analyses. ALFRED’s website has been designed with the user in mind. In an effort to simplify the process of browsing for particular allele frequencies or other data, ALFRED contains three separate search engines, two of which can be used to query the data and one that allows the user to browse a summarized version of the data.

5. Future Developments

It is clear that much synergy exists between different microsatellite, minisatellite, SNP, and gene frequency databases. ALFRED hopes that continued collaboration between these resources can see these data being correlated. An open-source policy on all data held in these databases will also enable open access to the data and ready comparisons between the different resources. ALFRED has aligned itself with the goals of the Human Genome Variation Society.

Combining data from various databases.

As increasing numbers of allelic variants and polymorphisms are being discovered, it is important to know what marker changes are indicative of variation of the actual gene or locus in question. Currently, there are relatively few good tools for looking at this. Knowledge of the pattern of linkage disequilibrium can provide measures that limit the search region for mutations causing disease. ALFRED plans to expand its database of actual allele frequency data to include haplotype frequency data for different world populations. This will facilitate mapping of disease-causing mutations by indicating what patterns of markers should be used in different populations at different sample sizes.

Incorporate data on linkage disequilibrium and LD mapping.

1 Comment

Leave a reply

ezine articles
Logo