Open data for genomics research

For those interested in open data, there’s a noteworthy recent press release, about an addition to the NIH Database of Genotypes and Phenotypes (dbGaP; see also posts to Open Access News about dbGaP):

Genome-Wide Association Study on Parkinson’s Disease Finds Public Home at NIH, Geoff Spencer, NIH News, March 4, 2008. Excerpts:

The study, conducted by researchers at Mayo Clinic in Rochester, Minn., in collaboration with scientists at Perlegen Sciences, Inc., in Mountain View, Calif., was the first genome-wide association study applied to Parkinson’s disease. It was funded under MJFF’s Linked Efforts to Accelerate Parkinson’s Solutions (LEAPS) initiative.

The raw data from a GWAS [genome-wide association] study is frequently useful to other researchers who may combine it with their own data to improve the analytical power and even make new discoveries. But such information may not be deposited in unregulated public databases because the data typically contain details that could be used to identify study volunteers, potentially violating their confidentiality. In order to protect the volunteers’ confidentiality, NIH requires the data submitters to remove identifying information (names, social security numbers, etc.). In addition, researchers who want to use the data must ask for permission and agree to other data use limitations, such as safeguarding participants’ privacy and using the data in ways consistent with consent agreements signed by study subjects. The researcher requests are then reviewed by a data access committee or DAC. Data access committees have been established at several NIH institutes that organize and support GWAS. Because this project was primarily supported by a private foundation, it lacked a DAC to review access requests, so it was considered an orphan data set.

NHGRI’s data access committee recently agreed to adopt the study and manage the data access approval process so that the data could be made widely available while ensuring appropriate protections.

For researchers who want to view the Mayo-Perlegen LEAPS Collaboration data, dbGaP offers two levels of access. The first is open access, where certain data are available without restriction, and the second is controlled access, which requires authorization. The open-access section allows users to view study documents that do not risk identifying individual participants, such as protocols and summaries of genotype and phenotype data. The controlled-access portion of the database allows approved researchers to download individual-level genotype and phenotype data from which the study participants’ personal identifiers, such as names, have been removed.

Although personally identifying information is not included in the database, concern remains that it may someday be possible to identify someone based on their genetic profile. For this reason only researchers agreeing not to attempt to identify individuals in the database will be given access to the data, as outlined in NIH’s Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS) available at

An excerpt from section IV of NIH’s policy, IV. Publication:

The NIH expects that investigators who contribute data to the NIH GWAS data repository will retain the exclusive right to publish analyses of the dataset for a defined period of time following the release of a given genotype-phenotype dataset through the NIH GWAS data repository (including the pre-computed analyses of the data). During this period of exclusivity, the NIH will grant access through the DACs to other investigators, who may analyze the data, but are expected not to submit their analyses or conclusions for publication during the exclusivity period. The maximum period of exclusivity is twelve months from the date that the GWAS dataset is made available for access through the NIH GWAS data repository …

There’s a burgeoning literature on the ethical issues raised by human genomics research. Although this literature provides yet another example of information that should be as openly accessible as possible, some of it is, and some isn’t. A prominent example of an article that isn’t OA:

ETHICS: Identifiability in Genomic Research, William W. Lowrance and Francis S. Collins, Science 2007(Aug 3); 317(5838): 600-2 (subscription required). Excerpts:

Open versus controlled release. A cultural habit of rapid, open release of genomic data has been pursued by the involved scientists and institutions since the beginning of the Human Genome Project (19-20). There is no question about the research advantages of such principles and policies. But almost certainly, the principles will have to be modified now for databases that include extensive genotypic information, to heighten the protection of identifiability (21).

Open data release, as with deposition in a publicly accessible Web site, is acceptable only if either: (i) the data are for all practical purposes not identifiable; or (ii) consent to the release is ethically legitimate and is granted by the data subjects, or the necessity for consent is waived by a competent ethics body. Most projects now take three precautionary steps: sequestering the standard identifiers via key-coding; performing disclosure risk-reduction (such as by rounding birth date to year of birth); and providing access to the de-identified data under conditional terms.

In contrast, references 19-21 (see excerpt above) are freely accessible.

Reference 19: National Human Genome Research Institute, Reaffirmation and extension of NHGRI rapid data release policies, National Human Genome Research Institute, February 2003.

Reference 20: Wellcome Trust, Policy on data management and sharing, Wellcome Trust, London, January 2007.

Reference 21: W. W. Lowrance, Access to Collections of Data and Materials for Health Research: A Report to the Medical Research Council and The Wellcome Trust, Wellcome Trust, London, 2006,

An excerpt from reference 21 (the first paragraph of the Executive summary):

Access to collections can be improved, and most scientists hope it will be. But if access is to be optimised, not only will barriers have to be reduced but the provision of access will have to be actively facilitated, guided, funded and rewarded.

An e-letter response to the article by Lowrance and Collins is also freely accessible. It’s from John Gallacher: Identifiability in Genomic Research (Science, 13 November 2007). Excerpt:

The article [by Lowrance and Collins] is written from within the cultural habit of the human genome project in which the rapid and open release of data has proven very advantageous and in which individual identities are easy to conceal and of no social or scientific importance. However, the world that the article explores is epidemiologic, where research cycles are longer and, very shortly, the data will ideally comprise complete genomes on large numbers of individuals linked to detailed clinical information. In the epidemiologic world, public confidence is everything. No confidence means no participation and no epidemiology.

Another example of a relevant article that’s freely accessible:

Data sharing and intellectual property in a genomic epidemiology network: policies for large-scale research collaboration, Dave A. Chokshi, Michael Parker, Dominic P. Kwiatkowski, Bull World Health Organ, 2006(May); 84(5): 382-7. Excerpts:

We propose two fundamental principles upon which to base policy decisions about data sharing and intellectual property: (1) impediments to innovation in research processes should be minimized, and (2) the fruits of research — eventual products that result from scientific discoveries — should be made as widely accessible as possible, particularly to the people who need them the most.

Access to information outside the consortium

It is beyond the scope of this paper to deal with the general issue of protection of anonymity for research subjects, except to say that this is of paramount importance. A specific issue for genomic epidemiology is that genetic data may, in certain circumstances, indirectly identify individuals within a well-defined study population. Thus researchers and ethical committees need to weigh up the risks and benefits of different levels of personal genetic identification. For example, there is a difference in risk between releasing large amounts of genetic data for each individual within a small village that is identified and releasing the same data for subjects sampled randomly from a large population, even if both groups are fully anonymised. One way of reducing any potential risk to individuals is to publicly release only pooled data.

Four conclusions seem obvious:

1) It’s difficult to construct any convincing ethical justifications for limiting access to peer-reviewed publications in the field of human genomics research.

2) There are both ethical and pragmatic justications for some limitations on access to human genomics data.

3) The issues, both ethical and pragmatic, raised by open sharing of human genetic epidemiology data are substantially more complicated than those raised by open access to peer-reviewed publications.

4) The field of “open data” seems likely to be one of increasing significance for the open access movement.



  1. […] the first genome-wide association study applied to Parkinson’s disease. …Original post by tillje delivered by Medtrials and […]

  2. tillje said

    See also: Research Ethics Recommendations for Whole-Genome Research: Consensus Statement, by Caulfield T, McGuire AL, Cho M, Buchanan JA, Burgess MM, et al., PLoS Biol 2008(25 Mar); 6(3): e73. An excerpt from the section on “Public Data Release“:

    International policies call for the rapid public release of all sequence data [30–32]. The benefit of public data access is that it provides significant scientific utility by enabling immediate international research use of the data. However, policies that advocate unrestricted data sharing have been challenged because of the privacy risks associated with public access to genomic information [33,34]. Whole-genome research increases these privacy concerns, particularly the uncertainties surrounding the implications of the data. As a result, a cautious approach seems warranted. Investigators and research ethics boards need to carefully consider whether public data release is warranted.

    References 30 to 34 in the above excerpt are:

    30. The Wellcome Trust (2003) Sharing data from large-scale biological research projects: A system of tripartite responsibility. Available: Accessed 14 February 2008. [See also:]
    31. Genome Canada (2005) Data release and resource sharing policy. Available: Accessed 14 February 2008.
    32. National Human Genome Research Institute (2003) Reaffirmation and extension of NHGRI rapid data release policies: Large-scale sequencing and other community resource projects. Available: Accessed 14 February 2008.
    33. McGuire AL, Gibbs RA (2007) No longer de-identified. Science 312: 370-371 doi:10.1126/science.1125339.
    34. Lowrance WW, Collins FS (2007) Identifiability in genomic research. Science 317: 600-602 doi:10.1126/science.1147699.

RSS feed for comments on this post · TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: