Mining the biological literature about a model organism

Excerpts from Wikipedia about the model organism Caenorhabditis elegans:

Caenorhabditis elegans … is a free-living nematode (roundworm), about 1 mm in length, which lives in temperate soil environments. Research into the molecular and developmental biology of C. elegans was begun in 1974 by Sydney Brenner [1] and it has since been used extensively as a model organism.

As for most model organisms, there is a dedicated online database for the species that is actively curated by scientists working in this field. The WormBase database attempts to collate all published information on C. elegans and other related nematodes.

There’s a “Find” search box at the top of the WormBase home page. After putting the phrase “open access” into the search box, and after selection of the “Literature Search” option in the drop-down menu, a search yielded matches in 214 of the documents in the Literature database. The “Database Description” at the bottom of each page of results stated that the “Current database contains 9475 full text papers and 24612 abstracts“.

The first 25 papers identified in the results of this search were all openly accessible via an “online text” link to the website of the journal in which the document was published. However, it appears that only a small minority of the 9475 full text papers were identified as “open access” papers. (Probably fewer than 214/9475=2%, because some of the 214 “open access” documents appear to have been identified more than once).

Another “Literature Search”, using “stem cell” as the keyword phrase, yielded matches in 867 documents. An “online text” link was present for 19 of the first 25 of these documents. This link provided free access to the full text for 9 of these 19 papers. (Unfortunately, whether or not the “online text” is freely accessible can only be determined by attempting to obtain access).

Although this latter finding suggests that a much higher proportion of the documents in this database may be freely accessible than are found by searching for the key words “open access”, a much more extensive study would need to be carried out in order to obtain a reliable estimate of the proportion of freely-accessible documents in this particular database.

The software that powers the WormBase Literature Search is Textpresso. On the About Textpresso webpage, it’s stated that: “Textpresso is an information extracting and processing package for biological literature“.

I wasn’t aware of this software until I saw an item, Presentations from Harvard publishing conference, posted by Peter Suber to Open Access News (December 29, 2007). This item led me to the PDF version of a presentation by Robert Kiley, Head of Systems Strategy, Wellcome Library. He participated in a panel session at the Harvard conference on Publishing in the New Millenium (Cambridge, November 9, 2007). His presentation was entitled: “Wellcome Trust and open access“. The heading on Slide 7/12 was: “New resources from mining the literature: Textpresso“.

A useful demonstration model of the ways that are being developed to “mine the literature”? (But, consider how much more useful this model would be if all of the documents in the Literature database were openly accessible).



  1. tillje said

    With the advanced search option set to “on”, I’ve searched in Wormbase for articles that include the phrase “stem cell” in the title, abstract or body of the article. Selection of articles published in 2006 yielded a total of 133 documents. I then checked the first 25 of these documents (ones that were journal articles, omitting, for example, chapters in books). I found that the free full text was available for 13 (52%) Of these, only 3 were fully OA articles. I then checked the last 25 of the 133 documents that were journal articles, and again, the free full text was available for 13 (again, 52%). None of these latter 25 journal articles were fully OA. An example of a fully-OA document is the first one that I found: Regulation of germline stem cell proliferation downstream of nutrient sensing, by Patrick Narbonne and Richard Roy, Cell Division 2006; 1: 29.

    So, the free full text was available for about half of the articles in this sample of 50 journal articles in Wormbase. All were selected to have been published in 2006, so more than a year has passed since their date of publication. It’s journal policy about provision of free access to the full text after a delay period that has played the most important role in this particular sample. This finding illustrates the importance of the duration of the “embargo period” prior to provision of free access to the full text of articles published in journals.

  2. tillje said

    Wormbase is one of the databases listed in the Gene Ontology (GO) list of GO Database Abbreviations. Another example: the List of genome-wide analysis papers in the Candida Genome Database, a “resource for genomic sequence data and gene and protein information” for Candida albicans.

    At present, 20 of the references associated with the Candida literature topic Genome-wide Analysis are to articles published in 2007. Of these 20, the free full text is available for 7 (35%). Of these 7 freely-accessible articles, 4 have been deposited in PubMed Central and one is fully OA. This latter article is: Genome-Wide Fitness Test and Mechanism-of-Action Studies of Inhibitory Compounds in Candida albicans, by Deming Xu and 12 co-authors, PLoS Pathog 2007(29 Jun); 3(6): e92.

    I didn’t expect to obtain free access to the full text of this many articles published in 2007.

    Added Jan. 9, 2008:

    Also, 22 of the references associated with the Candida literature topic Genome-wide Analysis are to articles published in 2006. Of these 22, the free full text was available for 12 (55%). Of these 12, 11 have been deposited in PubMed Central. Only one is fully OA: Control of the C. albicans Cell Wall Damage Response by Transcriptional Regulator Cas5, by Vincent M. Bruno and 5 co-authors, PLoS Pathog 2006(17 Mar); 2(3): e21.

    So, the free full text was available for just over half of the articles published in 2006.

