Home    Multimedia   Follow:  subscribe Subscribe   Library Connect News via RSS RSS   Become a Fan on Facebook Facebook   Follow us on Twitter Twitter   Share:      Delicious  Delicious
Library Connect, Partnering with the Library Community. www.elsevier.com/libraryconnect


search this site search web
PDF View PDF    Browse archives
Features: Text Mining: It’s More Exciting Than It Sounds
<< First  |  < Previous  |  Next >  |  Last >>

Features
Text Mining: It’s More Exciting Than It Sounds
Marc Krellenstein, VP, Search and Discovery Technology, Elsevier, Burlington, MA, USA

It’s rare that a name for a hot technology trend undersells its full potential, but that’s surely the case for what is known as text mining. Viewed narrowly, text mining is about automatically extracting, from unstructured text documents, all instances, or entities, of a certain type. For example, all of the drugs discussed - penicillin, tetracycline, aspirin and so on. But an extracted entity could itself be a specific kind of relationship among simpler entities, e.g., that penicillin treats pneumonia. This is more like extracting a fact than a simple object or concept, and such text mining is sometimes referred to as “fact extraction.”

One could say that a person’s comprehension of reading a document is also simply the extraction of all the relevant concepts and facts in it. In truth, both text mining and human reading comprehension are better viewed as a deeper effort to extract the meaning contained in a document - what the items and concepts are, what the connections are among them, what’s being said about them. In the online world text mining helps us go beyond searching to try to represent the meaning of documents - to summarize, show relationships and answer complex questions. Text mining can uncover completed unexpected relationships in a way that would be almost impossible to determine manually. The Columbia University GeneWays project, for example, has successfully extracted over a million unique interactions from relatively limited journal content, and several large pharmaceutical companies already use text mining to try to keep up with a biological literature that grows faster than the ability of researchers to read it.

Text mining in a bit more detail

As mentioned, an example of text mining might be extracting all of the instances of the type drug - penicillin, tetracycline, aspirin, etc. - from a given text corpus. The results of such an extraction could be a simple list of all the drugs found, perhaps with the number of times each drug was mentioned.

Adding metadata increases the usefulness of information

More usefully, the extraction might result in new metadata being attached to each document in the corpus, indicating each drug found and perhaps associated information such as its location in the document. Such metadata might look like
this in XML:

<drug offset = 86>penicillin</drug>
<drug offset = 124>tetracycline</drug>
<drug offset = 213>penicillin</drug>
<drug offset = 398>aspirin</drug>

By retaining the association between individual documents and drugs we can still produce a list of drugs but can do other things as well, such as creating a new, “synthetic” document that includes a few lines of text around each occurrence of a drug name in the corpus, effectively creating a drug summary for the corpus.

Using rules and controlled vocabularies

To accomplish such mining we might start with relevant controlled vocabularies, e.g., a list of all known drugs, and simply look for occurrences of them in the text. Text mining software enhances such matching with various rules to find occurrences regardless of spacing, capitalization, misspellings, intervening words, alternative word forms, etc. In addition, pattern matching rules and natural language processing can uncover new terms not included in the vocabularies (e.g., newly named drugs) by looking at how the names are constructed and at the contexts in which they’re used.

Adding relationships makes things more interesting

Things get more interesting when multiple entities are extracted from a corpus – e.g., all the drugs and all the diseases – and particular relationships between the entities are identified and extracted, such as when a drug is a treatment for a disease or when a drug might trigger a disease as a side effect. Once again, the mining could result in marking up individual documents with the occurrences of such discovered relationships. For example:

<drug-treats-disease>
<drug offset=86>penicillin</drug>
<disease offset = 124>pneumonia</disease>
</drug-treats-disease>
<drug-causes-disease>
<drug offset=213>penicillin</drug>
<disease offset = 268>anaphylactic shock</disease>
</drug-causes-disease>

Natural language processing aids the identification of relationships

Identifying relationships requires more extensive natural language processing to discover that two entities are not simply mentioned together but are connected by the specified relationship. This processing includes lists of appropriate verbs or other word forms specifying the particular connection and more complex sentence analysis to recognize the relationship and determine that it actually exists (as opposed to, for example, its negation, which is still a relationship but a different one). This includes dealing with different tenses and voices, identifying and separating descriptive phrases and resolving indirect references (known as anaphora). An example of the last is uncovering the causal relationship with penicillin and shock in the following sentence, in which the word ‘penicillin’ is connected to shock only indirectly through a pronoun:

Penicillin treatment is not without risks. It can trigger anaphylactic shock in allergic individuals.

Innovative new tools allow users to analyze and visualize data

Opposite is a diagram produced by an analytic tool from ClearForest that takes mined, extracted data as its input and allows a user to analyze and visualize that data in various ways. This particular diagram indicates whether a significant a relationship of any kind is present between all the various genes (and related biological entities) and diseases mentioned in some 25,000 Elsevier journal abstracts mined as part of a pilot project. It also includes two sub-diagrams that are displayed when a particular entity - the gene p53 in this case - is expanded and viewed to see, first, all the genes and diseases related to it, and, second, all the organisms in which it is studied. The documentary evidence is displayed when one clicks on the bar indicating a relationship between two entities (as shown in the screenshot below).

Innovative new tools allow users to analyze and visualize data

Commercial text mining products from companies such as Inxight
www.inxight.com) or ClearForest (  www.clearforest.com) often come with standard rule sets (or “recognizers”) for identifying entities such as people and companies, as well as basic relationships between them - a company buying/divesting/suing another company, or a person holding a certain position within a company. And while current text mining efforts are certainly not perfect, perfection is not necessarily needed (or achieved by people either). Occasional, often obvious mistakes can sometimes be ignored, and failing to find something in one case may be compensated for by finding it mentioned somewhere else.

Some more complex applications of text mining

More complex text mining moves beyond relatively simple entities and relationships to complex or composite entities such as events, or evaluative descriptions. The US government makes use of text mining to look for connections between people and events as part of its investigations into terrorist activities, where an event entity could itself be a complex relationship among simpler entities such as location, date, person(s) and associated event type (e.g., an explosion). IBM’s WebFountain, a service for mining the Internet, is being used to identify “reputations” of companies by uncovering positive or negative descriptions of them.

Modelling human comprehension

Modelling human comprehension has long been a goal of research in artificial intelligence. The success of text mining work and the natural language processing tools developed for it seems due to the fact that it proceeds from practical efforts to identify specific information. Each specific recognizer is necessarily limited, but basic building blocks - for example, recognizing drugs, people or events - can be reused. There is no particular limit to what the paradigm may be applied to - experimental methods? refuted theories? Government policies? - as long as specific needed linguistic tools are there or can be developed.

Humans don’t seem to require as much specialized preparation for understanding something but perhaps that is because we have already built up such a large repertoire of things we do understand. Certainly there are times when we will not understand something unless we first educate ourselves, building up from the basics.

What does the future hold?

Text mining represents one of the most practical and productive uses of natural language processing methods (and other AI techniques) today, and the most successful paradigm to date to simulate human comprehension. Its continued development is likely to stimulate additional focused, useful research in “natural language” and other technologies that support it. Most importantly though, it puts us on a path where we can proceed to automatically extract and accumulate the sort of knowledge that has only been possible to do by hand (or, more accurately, by eye and brain). This will allow publishers and libraries to offer services that dramatically increase the value of the content they offer.

As information continues to expand at rates greater than the processing ability of any one person, or perhaps of all persons, text mining may prove indispensable to support continued rapid advances in our understanding of the world. end bullet

Password-Guessing:
A Threat to Security?

Back in March, Elsevier’s EDIT (Elsevier Dayton Information Technology) team noticed a number of attempts at password-guessing involving ScienceDirect accounts. Further investigation revealed that in most cases password-guessing occurred where generic passwords such as library, science, sciencedirect, or password, were being used.

As a short-term measure, all ScienceDirect user IDs with compromised passwords had their passwords reset and customers were informed by Elsevier eCustomer Service.

To ensure future passwords are more secure, a new ScienceDirect system will be implemented making it compulsory for passwords to be alphanumeric and contain more than six digits.

In the meantime please try to avoid generic and obvious passwords like: password, science, sciencedirect, your own username or library. Thank you.

Back to Top