Posts Tagged open data

Panton Principles for Open Data

Launch of the Panton Principles for Open Data in Science and ‘Is It Open Data?’ Web Service, Jonathan Gray, Open Knowledge Foundation Blog, February 19, 2010 [Connotea bookmark][Topsy search results][Panton Principles][Is It Open Data?].

A commentary: Panton Principles for Open Data in Science, Bill Hooker, Open Reading Frame, February 19, 2010 [FriendFeed entry].

The Principles:

  • When publishing data make an explicit and robust statement of your wishes [with respect to re-use and re-purposing of individual data elements, the whole data collection, and subsets of the collection].
  • Use a recognized waiver or license that is appropriate for data [many widely recognized licenses are not appropriate].
  • If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.
  • Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

Some background information about the origin of these Principles (at the Panton Arms, Cambridge): The Panton Principles: A breakthrough on data licensing for public science? Peter Murray-Rust, petermr’s blog, May 16th, 2009, and: A breakthrough on data licensing for public science? Cameron Neylon, Science in the Open, May 15, 2009.


Comments (4)

Guidance re data sharing and patient privacy

New guidance on data sharing will minimize risks to patient privacy, EurekAlert, January 28, 2010.

And: BMJ policy on data sharing, Trish Groves, BMJ 2010(Jan 28); 340: c564 (Editorial; only the first 150 words are publicly accessible).

See also: How to publish raw clinical data: guidelines from Trials and the BMJ, Matthew Cockerill, BioMed Central Blog, January 29, 2010.

About this article: Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers by Iain Hrynaszkiewicz, Melissa L Norton, Andrew J Vickers, Douglas G Altman, Trials 2010(Jan 29); 11(1): 9 [Epub ahead of print][Connotea bookmark][PubMedCitation].

This article has been co-published: BMJ 2010(Jan 28); 340: c181 [PubMed Citation]. Summary points:

Despite journal and funder policies requiring data sharing, there has been little practical guidance on how data should be shared

Confidentiality and anonymity are key considerations when publishing or sharing data relating to individuals, and this article provides practical advice on data sharing while minimising risks to patient privacy

Consent for publication of appropriately anonymised raw data should ideally be sought from participants in clinical research

Direct identifiers such as patients’ names should be removed from datasets; datasets that contain three or more indirect identifiers, such as age or sex, should be reviewed by an independent researcher or ethics committee before being submitted for publication

Leave a Comment

Making the best of research data

Sharing knowledge: a new frontier for public-private partnerships in medicine, Charles Auffray, Genome Med 2009(Mar 4); 1(3): 29. [OA to full text (free registration required)]. Abstract:

To help overcome the bottlenecks that limit the development of diagnostic and therapeutic products, academic and industrial researchers, patient organizations and charities, and regulatory and funding institutions should redefine the basis for sharing the knowledge collected in large-scale clinical and experimental studies.

Found via: Sharing of knowledge – Making the best of research data, Iratxe Puebla, BioMed Central Blog, March 12, 2009.

Leave a Comment

Open access genomes

Open access genomes! (but how is OA protected?), Dave Love, dave love’s blog, October 21, 2008. Excerpt:

We have come a long way in the last 15 years since Craig Venter and his company, Celera, refused to deposit their human genomic sequence in NCBI/GenBank and others who practice gene patenting deflated our collective tyres. I think that PGP understands the benefits of being OA, but I didn’t see anything on their website about a legal backbone to protect that access, such as a Creative Commons copyrights. I hope they will get some advice on this from librarians, lawyers, publishers, and others in the OA community!

For more about the PGP, see: More on the Personal Genome Project, Gavin Baker, Open Access News, October 20, 2008; Profile of the Personal Genome Project, Gavin Baker, Open Access News, August 06, 2008; The Personal Genome Project, George M Church, Mol Syst Biol 2005; 1: 2005.0030 [Epub 2005(Dec 13)].

Modified on October 21, 2008:

See also: From genetic privacy to open consent, Jeantine E Lunshof, Ruth Chadwick, Daniel B Vorhaus and George M Church, Nat Rev Genet 2008(May); 9(5): 406-11. Excerpts from the full text (not OA):

Box 3 | Key features of the Personal Genome Project’s open-consent policy

Open consent as part of the Personal Genome Project implies that research participants accept that:

* Their data could be included in an open-access public database.
* No guarantees are given regarding anonymity, privacy and confidentiality.
* Participation involves a certain risk of harm to themselves and their relatives.
* Participation does not benefit the participants in any tangible way.
* Compliance with monitoring of their well-being through quarterly questionnaires is required.
* Withdrawal from the study is possible at any time.
* Complete removal of data that have been available in the public domain may not be possible.

The moral goal of open consent is to obtain valid consent by effectuating veracity as a precondition for valid consent and effectuating voluntariness through strict eligibility criteria, as a precondition for substantial informed consent.

[End of Box 3]

Open consent. Open consent means that volunteers consent to unrestricted re-disclosure of data originating from a confidential relationship, namely their health records, and to unrestricted disclosure of information that emerges from any future research on their genotype–phenotype data set, the information content of which cannot be predicted. No promises of anonymity, privacy or confidentiality are made. The leading moral principle is veracity — telling the truth — which should precede autonomy. Although, in clinical medicine, veracity is the legal norm in many jurisdictions, physicians may try to justify the withholding of information by invoking the ‘therapeutic privilege’. In research, there is no such privilege, and when seeking informed consent from research subjects, distorted or incomplete information could undermine trust in researchers and in science.

Comment: Those contemplating OA to genetic data need to pay careful attention to the concept of “open consent“, and its emphasis on “telling the truth” and on “voluntariness”. It’s also noted in the full text that “in the PGP potential volunteers are strongly advised to discuss their participation with relatives“.

Comments (5)

Open data for genomics research

For those interested in open data, there’s a noteworthy recent press release, about an addition to the NIH Database of Genotypes and Phenotypes (dbGaP; see also posts to Open Access News about dbGaP):

Genome-Wide Association Study on Parkinson’s Disease Finds Public Home at NIH, Geoff Spencer, NIH News, March 4, 2008. Excerpts:

The study, conducted by researchers at Mayo Clinic in Rochester, Minn., in collaboration with scientists at Perlegen Sciences, Inc., in Mountain View, Calif., was the first genome-wide association study applied to Parkinson’s disease. It was funded under MJFF’s Linked Efforts to Accelerate Parkinson’s Solutions (LEAPS) initiative.

The raw data from a GWAS [genome-wide association] study is frequently useful to other researchers who may combine it with their own data to improve the analytical power and even make new discoveries. But such information may not be deposited in unregulated public databases because the data typically contain details that could be used to identify study volunteers, potentially violating their confidentiality. In order to protect the volunteers’ confidentiality, NIH requires the data submitters to remove identifying information (names, social security numbers, etc.). In addition, researchers who want to use the data must ask for permission and agree to other data use limitations, such as safeguarding participants’ privacy and using the data in ways consistent with consent agreements signed by study subjects. The researcher requests are then reviewed by a data access committee or DAC. Data access committees have been established at several NIH institutes that organize and support GWAS. Because this project was primarily supported by a private foundation, it lacked a DAC to review access requests, so it was considered an orphan data set.

NHGRI’s data access committee recently agreed to adopt the study and manage the data access approval process so that the data could be made widely available while ensuring appropriate protections.

For researchers who want to view the Mayo-Perlegen LEAPS Collaboration data, dbGaP offers two levels of access. The first is open access, where certain data are available without restriction, and the second is controlled access, which requires authorization. The open-access section allows users to view study documents that do not risk identifying individual participants, such as protocols and summaries of genotype and phenotype data. The controlled-access portion of the database allows approved researchers to download individual-level genotype and phenotype data from which the study participants’ personal identifiers, such as names, have been removed.

Although personally identifying information is not included in the database, concern remains that it may someday be possible to identify someone based on their genetic profile. For this reason only researchers agreeing not to attempt to identify individuals in the database will be given access to the data, as outlined in NIH’s Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS) available at

An excerpt from section IV of NIH’s policy, IV. Publication:

The NIH expects that investigators who contribute data to the NIH GWAS data repository will retain the exclusive right to publish analyses of the dataset for a defined period of time following the release of a given genotype-phenotype dataset through the NIH GWAS data repository (including the pre-computed analyses of the data). During this period of exclusivity, the NIH will grant access through the DACs to other investigators, who may analyze the data, but are expected not to submit their analyses or conclusions for publication during the exclusivity period. The maximum period of exclusivity is twelve months from the date that the GWAS dataset is made available for access through the NIH GWAS data repository …

There’s a burgeoning literature on the ethical issues raised by human genomics research. Although this literature provides yet another example of information that should be as openly accessible as possible, some of it is, and some isn’t. A prominent example of an article that isn’t OA:

ETHICS: Identifiability in Genomic Research, William W. Lowrance and Francis S. Collins, Science 2007(Aug 3); 317(5838): 600-2 (subscription required). Excerpts:

Open versus controlled release. A cultural habit of rapid, open release of genomic data has been pursued by the involved scientists and institutions since the beginning of the Human Genome Project (19-20). There is no question about the research advantages of such principles and policies. But almost certainly, the principles will have to be modified now for databases that include extensive genotypic information, to heighten the protection of identifiability (21).

Open data release, as with deposition in a publicly accessible Web site, is acceptable only if either: (i) the data are for all practical purposes not identifiable; or (ii) consent to the release is ethically legitimate and is granted by the data subjects, or the necessity for consent is waived by a competent ethics body. Most projects now take three precautionary steps: sequestering the standard identifiers via key-coding; performing disclosure risk-reduction (such as by rounding birth date to year of birth); and providing access to the de-identified data under conditional terms.

In contrast, references 19-21 (see excerpt above) are freely accessible.

Reference 19: National Human Genome Research Institute, Reaffirmation and extension of NHGRI rapid data release policies, National Human Genome Research Institute, February 2003.

Reference 20: Wellcome Trust, Policy on data management and sharing, Wellcome Trust, London, January 2007.

Reference 21: W. W. Lowrance, Access to Collections of Data and Materials for Health Research: A Report to the Medical Research Council and The Wellcome Trust, Wellcome Trust, London, 2006,

An excerpt from reference 21 (the first paragraph of the Executive summary):

Access to collections can be improved, and most scientists hope it will be. But if access is to be optimised, not only will barriers have to be reduced but the provision of access will have to be actively facilitated, guided, funded and rewarded.

An e-letter response to the article by Lowrance and Collins is also freely accessible. It’s from John Gallacher: Identifiability in Genomic Research (Science, 13 November 2007). Excerpt:

The article [by Lowrance and Collins] is written from within the cultural habit of the human genome project in which the rapid and open release of data has proven very advantageous and in which individual identities are easy to conceal and of no social or scientific importance. However, the world that the article explores is epidemiologic, where research cycles are longer and, very shortly, the data will ideally comprise complete genomes on large numbers of individuals linked to detailed clinical information. In the epidemiologic world, public confidence is everything. No confidence means no participation and no epidemiology.

Another example of a relevant article that’s freely accessible:

Data sharing and intellectual property in a genomic epidemiology network: policies for large-scale research collaboration, Dave A. Chokshi, Michael Parker, Dominic P. Kwiatkowski, Bull World Health Organ, 2006(May); 84(5): 382-7. Excerpts:

We propose two fundamental principles upon which to base policy decisions about data sharing and intellectual property: (1) impediments to innovation in research processes should be minimized, and (2) the fruits of research — eventual products that result from scientific discoveries — should be made as widely accessible as possible, particularly to the people who need them the most.

Access to information outside the consortium

It is beyond the scope of this paper to deal with the general issue of protection of anonymity for research subjects, except to say that this is of paramount importance. A specific issue for genomic epidemiology is that genetic data may, in certain circumstances, indirectly identify individuals within a well-defined study population. Thus researchers and ethical committees need to weigh up the risks and benefits of different levels of personal genetic identification. For example, there is a difference in risk between releasing large amounts of genetic data for each individual within a small village that is identified and releasing the same data for subjects sampled randomly from a large population, even if both groups are fully anonymised. One way of reducing any potential risk to individuals is to publicly release only pooled data.

Four conclusions seem obvious:

1) It’s difficult to construct any convincing ethical justifications for limiting access to peer-reviewed publications in the field of human genomics research.

2) There are both ethical and pragmatic justications for some limitations on access to human genomics data.

3) The issues, both ethical and pragmatic, raised by open sharing of human genetic epidemiology data are substantially more complicated than those raised by open access to peer-reviewed publications.

4) The field of “open data” seems likely to be one of increasing significance for the open access movement.

Comments (2)