Penetrating the Deep Web

Stephen DeAngelis

March 10, 2009

I have written a number of posts about search engines and the World Wide Web. New search engines are routinely rolled out along with a hope that they will challenge then dislodge Google from the top of the heap. For its part, Google continually tries to improve its own technology so that it remains on top and competitive. The Web has developed from being a loose connection of discoverable sites (Web 1.0) to a more complex interconnected array of social sites (Web 2.0) and the next objective is to make it even smarter by increasing its understanding of relationships in what is known as the Semantic Web or Web 3.0. Alex Wright reports that efforts are underway to penetrate what is known as the Deep Web — the vast collection of databases that must be connected in order to achieve the objectives of a Semantic Web [“Exploring a ‘Deep Web’ That Google Can’t Grasp,” New York Times, 22 February 2009]. Wright reports:

“One day last summer, Google’s search engine trundled quietly past a milestone. It added the one trillionth address to the list of Web pages it knows about. But as impossibly big as that number may seem, it represents only a fraction of the entire Web. Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines. The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like ‘What’s the best fare from New York to London next Thursday?’ The answers are readily available — if only the search engines knew how to find them. Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online.”

For casual users of the Web, the most commonly heard term is “surfing” the Web. For those intimately familiar with the Web, however, a more accurate term is “crawling.” Wright informs his readers that search engines like Google rely on programs “known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together.” When you think about it, the imagery of a spider on the Web makes a lot more sense than a surfer. The problem, Wright says, is that crawlers work well on for finding pages on the “surface Web,” but “have a harder time penetrating databases that are set up to respond to typed queries. ‘The crawlable Web is the tip of the iceberg,’ says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources.”

 

In other words, the goal of a Deep Web search to provide understanding not just results. The challenge is that even the most powerful search engines can’t possibly “sift through every possible combination of data on the fly. To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases.” This is where semantics come into play.

“For example, if a user types in ‘Rembrandt,’ the search engine needs to know which databases are most likely to contain information about art ( say, museum catalogs or auction houses), and what kinds of queries those databases will accept. That approach may sound straightforward in theory, but in practice the vast variety of database structures and possible search terms poses a thorny computational challenge.”

Of course, Rembrandt could also refer to a brand of toothpaste. Semantics is a challenge with which my company, Enterra Solutions, is intimately familiar. Part of our business is creating rules and rule sets that can automate business processes. The hardest part of doing that is creating an ontology that establishes a framework for relationships. Take the word “tank.” It could mean a mobile military cannon, a container for holding liquids, or the current state of the U.S. economy. In order to provide the proper relationship, a search engine needs to understand all of these meanings and then discern which of them the inquirer is most likely to want to know about.

“‘This is the most interesting data integration problem imaginable,’ says Alon Halevy, a former computer science professor at the University of Washington who is now leading a team at Google that is trying to solve the Deep Web conundrum. Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — ‘Rembrandt,’ ‘Picasso,’ ‘Vermeer’ and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.”

Lots of Web users are probably wondering what’s wrong with things the way they are. Futurists, however, are never satisfied with the status quo. Those hoping to penetrate the Deep Web recognize how much untapped information is available and openly wonder what could be accomplished if that information could be tapped and connected. As one might imagine, interest is particularly keen at institutions of higher education.

“Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game. ‘The naïve way would be to query all the words in the dictionary,’ Ms. Freire said. Instead, DeepPeep starts by posing a small number of sample queries, ‘so we can then use that to build up our understanding of the databases and choose which words to search.’ Based on that analysis, the program then fires off automated search terms in an effort to dislodge as much data as possible. Ms. Freire claims that her approach retrieves better than 90 percent of the content stored in any given database. Ms. Freire’s work has recently attracted overtures from one of the major search engine companies.”

The challenge becomes apparent if you go to the DeepPeep site and attempt to conduct a simple search. For example, I entered the word “development” and received back 313 sites relevant to that term. They covered everything from biological development to job development to property development. I didn’t immediately find anything relevant to the development of nation-states. Wright notes that companies like Google face another challenge as well. Part of Google’s “brand” is the way it presents search results. By dramatically changing the way it presents search results, Google risks losing some of its users. Penetrating the Deep Web, however, will require significantly different ways of viewing and sorting through search results.

“Beyond the realm of consumer searches, Deep Web technologies may eventually let businesses use data in new ways. For example, a health site could cross-reference data from pharmaceutical companies with the latest findings from medical researchers, or a local news site could extend its coverage by letting users tap into public records stored in government databases. This level of data integration could eventually point the way toward something like the Semantic Web, the much-promoted — but so far unrealized — vision of a Web of interconnected data. Deep Web technologies hold the promise of achieving similar benefits at a much lower cost, by automating the process of analyzing database structures and cross-referencing the results.”

As Wright implies, don’t hold your breath waiting for the Semantic Web. Semantics is a difficult nut to crack. For most users of the Web, penetrating the Deep Web is not that important. For businesses, however, penetrating the Deep Web could provide many benefits. “Mike Bergman, a computer scientist and consultant who is credited with coining the term Deep Web. Mr. Bergman said the long-term impact of Deep Web search had more to do with transforming business than with satisfying the whims of Web surfers.” The long-term benefits of the technologies now being developed include greater efficiency and more effective processes. My company understands how difficult it is to build a system that relies on semantic rules but it also understands the enormous benefits of such a system. Rules and rule sets do not come in a one-size fits all solution for businesses. Each solution much be tailored to a vertical or micro-vertical sector and then further tailored to a particular business’ operating environment. Such systems can make us more secure as the power of data bases helps track resources and goods around the globe. The more effectively and efficiently that is done, the smaller the “security tax” on the global economy becomes. Through the correct use of automated rule sets, privacy concerns can be minimized and system efficiency and effectiveness increased.