Analyse this…

21 February 2008

EDSAC, one of the first computers, was constructed in Cambridge in the late ’40s. Soon afterwards, researchers started investigating the use of computers to analyse human language, an area known as Natural Language Processing (NLP). Cambridge-based academics have been leading research in NLP from these early days until today.

Language is ambiguous: In the phrase “girls like dolls”, does “like” stand for “similar to” or “enjoy”? Who is holding the telescope in the sentence “John saw the man with the telescope”, “John” or “the man”? Who is “he” in the sentence “John did not meet Bill because he was ill”? To resolve these ambiguities, people use the surrounding text and their knowledge of the world. But this is much harder to do for a computer.

Suppose you are studying medicine and are looking for information about H5N1 (the bird flu virus). Googling for “H5N1” returns millions of hits! How about looking at specialised databases such as PubMed? Well, just a few thousands of articles to read there, just enough to keep you busy until your exam…Although NLP is not an exact science, it can be exploited to help scientists working in a specific field in their quest for relevant information. Current research in the University of Cambridge focuses on analysing the language of scientific articles in biology and chemistry.

Recognising the names that authors use to refer to genes is an important aspect of this research. This is not trivial since many gene names overlap with common words such as “not”, “an”, “was”, “and” or “if”. The aim is to enable the computer to tell whether, for example, “and” refers to a gene and disambiguate expressions such as “this gene” by looking at surrounding text. Identifying the relations between these entities e.g. “rhy-1 inhibits HIF-1 activity” which can be expressed in several other ways (e.g. “HIF-1 activity is inhibited by rhy-1”, “the inhibition of HIF-1 activity by rhy-1”, etc) is another important task. Recognising the names of chemical entities such as compounds and reactions and the relationships between them presents similar challenges.

This work is carried out in the Computer Laboratory in collaboration with the Department of Genetics and the Royal Society of Chemistry. By working closely with biologists and chemists, computer scientists and linguists are investigating how the NLP analysis can be presented to the users in order to help them extract important information from scientific publications more efficiently.

NLP is a very active research area. It has been progressing fast and has attracted significant commercial interest. So in a few years time, NLP techniques might be incorporated in standard search engines, making our internet searches less of a hassle.