CFL Software, PSC Collaborate on Next Generation of Information Searching

New software being developed by CFL Software may transform our ability to search for information in text documents as profoundly as search engines improved upon paper library card catalogs. The software, CFL Discover, will search electronic text documents far more completely and accurately than possible with today’s search technologies.

Pittsburgh Supercomputing Center (PSC) is collaborating with CFL as a strategic partner in developing CFL Discover, making the software available to researchers on Sherlock, a modified version of YarcData’s Urika,a real-time data discovery appliance at the center.

“This is a new venture both in terms of scale and speed in searching for information,” says David Woolls, CEO of CFL Software, which specializes in linguistic document forensics. “In essence, we take over where search engines stop.”

While many users may not be aware of it, search engines don’t completely search all the text in the entire Web — that would take far too long. Instead, they search indexes, keywords, categories and other “metadata” that have been added to those documents. In the case of keywords and categories, that addition has to be made by humans, and so is time-intensive and incomplete. Today’s engines obviously revolutionized our ability to find information, but they are inexact. Many irrelevant sites pop up, and many sites that may be more suitable aren’t captured. In a sense, we all stop when we reach a site that is “good enough” rather than one that’s best for our needs.

“Search engines start with a few words and return a list of documents which contain them,” Woolls adds. “CFL Discover starts with one or more of those documents and reads them for you, shows you the terminology that is shared and gives immediate access to the passages of particular interest to you.”

The program uses YarcData’s industry-standard SPARQL query language and RDF (Resource Description Framework) to search entire texts for meaningful connections between the words in a search query and related language in other texts. This kind of “graph search” enables someone searching for information to find relevant connections that they may not have thought of. The program is written in Java, so is platform independent and can work on anything from a standard PC to a Java-capable supercomputer. (While most supercomputers can’t run Java, two at PSC — Sherlock and Blacklight — do, providing valuable support for research communities that primarily use Java for data analytics.) The choice of platform and computer is solely dependent on the volume and speed of response required.

“It’s less like searching for a needle in a haystack than searching for a needle in a needlestack,” says Arvind Parthasarathi, President, YarcData. The advantage of CFL Discover is that it allows related groups of documents to be rapidly identified, not on the basis of pre-determined keywords and categories, but purely on the similarity of the content. This in turn allows the rapid creation of new combined databases from a collection of existing databases. For example, when searching Wikipedia, entering the title of an article causes CFL Discover to read the database, returning a comprehensive list of potentially interesting articles related to the whole content. And because the framework is RDF, searches of other RDF collections can be readily performed. The principles on which the program works allow it to be used in many different languages, including Arabic, Chinese, Thai and Finnish, which appear to be very disparate to the human eye.

“The structures and sequences inherent to individual documents are all that are needed to encode them,” Parthasarathi says. “New material is easily added to existing stores and is immediately available for use by the search queries.”

CFL Software has carried out proof-of-concept studies of CFL Discover to search U.S. Patent Office record and legal document description sections as well as Wikipedia. The collaboration with PSC will employ the program on PSC’s Sherlock, which is optimized to search extremely large and complex bodies of information with open-ended queries. The new work will explore a substantial portion of the U.S. Patent database, in addition to the full data of Wikipedia in more depth.

“PSC’s role in the partnership is to couple the unique analytic capability of Sherlock running CFL Discover with hosting massive datasets on PSC’s Data Supercell to expand text analytics to unprecedented, interdisciplinary use cases,” says Nick Nystrom, PSC’s director of strategic applications. “Response time is critical for exploring big data, and Sherlock with CFL Discover will provide rapid analyses of unstructured text data larger than can be done on any platform currently available to U.S. researchers.”

“We see high value for a wide range of research and societal applications,” Nystrom adds. Examples include analyzing recent events from news and social media sources, extracting deeper insights from sets of publications, and enabling computational history and culturomics — the quantitative study of cultural phenomena by analyzing large volumes of written records. “Application of high-performance analytics is new to these and similar fields, and will catalyze new ways of leveraging unstructured text data.”