Posts Tagged text mining

Mining the biological literature about a model organism

Excerpts from Wikipedia about the model organism Caenorhabditis elegans:

Caenorhabditis elegans … is a free-living nematode (roundworm), about 1 mm in length, which lives in temperate soil environments. Research into the molecular and developmental biology of C. elegans was begun in 1974 by Sydney Brenner [1] and it has since been used extensively as a model organism.

As for most model organisms, there is a dedicated online database for the species that is actively curated by scientists working in this field. The WormBase database attempts to collate all published information on C. elegans and other related nematodes.

There’s a “Find” search box at the top of the WormBase home page. After putting the phrase “open access” into the search box, and after selection of the “Literature Search” option in the drop-down menu, a search yielded matches in 214 of the documents in the Literature database. The “Database Description” at the bottom of each page of results stated that the “Current database contains 9475 full text papers and 24612 abstracts“.

The first 25 papers identified in the results of this search were all openly accessible via an “online text” link to the website of the journal in which the document was published. However, it appears that only a small minority of the 9475 full text papers were identified as “open access” papers. (Probably fewer than 214/9475=2%, because some of the 214 “open access” documents appear to have been identified more than once).

Another “Literature Search”, using “stem cell” as the keyword phrase, yielded matches in 867 documents. An “online text” link was present for 19 of the first 25 of these documents. This link provided free access to the full text for 9 of these 19 papers. (Unfortunately, whether or not the “online text” is freely accessible can only be determined by attempting to obtain access).

Although this latter finding suggests that a much higher proportion of the documents in this database may be freely accessible than are found by searching for the key words “open access”, a much more extensive study would need to be carried out in order to obtain a reliable estimate of the proportion of freely-accessible documents in this particular database.

The software that powers the WormBase Literature Search is Textpresso. On the About Textpresso webpage, it’s stated that: “Textpresso is an information extracting and processing package for biological literature“.

I wasn’t aware of this software until I saw an item, Presentations from Harvard publishing conference, posted by Peter Suber to Open Access News (December 29, 2007). This item led me to the PDF version of a presentation by Robert Kiley, Head of Systems Strategy, Wellcome Library. He participated in a panel session at the Harvard conference on Publishing in the New Millenium (Cambridge, November 9, 2007). His presentation was entitled: “Wellcome Trust and open access“. The heading on Slide 7/12 was: “New resources from mining the literature: Textpresso“.

A useful demonstration model of the ways that are being developed to “mine the literature”? (But, consider how much more useful this model would be if all of the documents in the Literature database were openly accessible).

Comments (2)