Friday, February 08, 2008

Text Analysis with OpenCalais in TopBraid

OpenCalais is an amazing Web Service that was recently made publicly available by Reuters. In a nutshell, OpenCalais takes arbitrary text or HTML documents as input and tries to extract semantic web entities from it. For example it can identify persons, companies and countries and returns them as machine-readable RDF data structures. Needless to say this extraction does not work perfectly well because understanding human languages requires (artificial) intelligence and a lot of implicit background knowledge. In any case it can create astonishing results. We ran it over the TopQuadrant management web site and it correctly identified all five people, their respective roles as well as parts of their former companies.

Such text-entity-extraction services have in the past been (very expensive) niche products. The recent announcement by Reuters (who have acquired the text mining company ClearForest last year) to make OpenCalais available for free came as a great surprise. After all, many customers of ours have requested features to import text into their ontologies in the past. Given all this, it was an obvious next step for us to include OpenCalais into TopBraid. TopBraid Composer 2.5 now includes several features that seamlessly integrate the Calais web service into data processing tasks. For example, you can extract RDF from arbitrary HTML files from the web and save the results into files. Or you can put .txt files into your workspace and directly import them into some other RDF/OWL project - Calais will be called automatically.

However, the real power of OpenCalais is exposed when used in data processing pipelines such as SPARQLMotion scripts. The following TopBraid screenshot shows a SPARQLMotion script that

  • loads the latest business news from a New York Times RSS feed
  • sends the text of the news items to OpenCalais (OpenCalais will identify all countries mentioned in the news)
  • iterates over all countries to request their geo coordinates from the geonames web service
  • displays all countries on a Google Map



This script is of course just one possibility of using information delivered by Calais. A more comprehensive solution would probably include a countries ontology that already has background information (including coordinates, capitals, financial details) about each country. Then SPARQLMotion could be used to create an intelligent agent that analyzes newsfeeds (or any other textual data source) against semantic query patterns such as "Alert me if there are any news about a company merger located in an oil-exporting country". If you want to play with all this, please download TopBraid Composer Maestro 2.5.0 but keep in mind that SPARQLMotion is work in progress and not complete (Matt Fischer recently wrote an independent review of an even older version of SPARQLMotion that illustrates some of the open issues).

Note that OpenCalais seems to be part of a larger roadmap at Reuters aiming at making "all the world's content more accessible and valuable". It is great to see a world-leading information company embrace the Semantic Web vision so directly! As a comprehensive information integration and ontology design tool, TopBraid Composer and its SPARQLMotion language seem to be ideal platforms to process, analyze and visualize the information that OpenCalais delivers.

0 Comments:

Post a Comment

<< Home