Research

Smart Search

"Surfing the Web.” Like a lot of information-age talk, the phrase is both schizophrenic and appropriate. For if this World Wide Web we grow to depend on is all-connected and all-connecting, an enfolding organism spinning out and out and out, it is also oceanic. Planet-girdling. Practically fathomless. And when it comes to using such a vast, dynamic resource, most of us are like the skinny kid in the half-zipped wetsuit, paddling just beyond the first break. We can stay up—mostly. On a good day, we can handle a three-foot curl. But we never stray too far from shore. And we can easily wind up a long way down the beach, with no idea how we got there.

blue yellow and gray moleculesSteve Coast

The images illustrating this article were drawn from a computer animation created by Steve Coast, a computer artist and student of physics at University College London. The animation was generated by mapping the connections between 10,000 computers on the Internet. The first two images and the cover of this magazine were originally rendered in three dimensions. Coast explains: "Each sphere represents a computer and the white cylinders the connections between them. Each cylinder acts as a spring with a natural rest length equal to the time separation between computers. Each sphere is electrostatically repulsive, pushing all the other spheres away. The result is that the springs and repulsive charges act against each other and produce a layout.

Lee Giles is not much interested in surfing. “Mining” and “extraction” are terms more to his liking. Giles, the David Reese professor of information sciences and technology at Penn State, has devoted his career to finding better ways to get at information, to wring the most out of it, to marshal it efficiently.

His background is in artificial intelligence, a field for which the processing of oceans of information is practically raison d'etre. In government labs, working for the Air Force and the Navy, and then at NEC Laboratories, a Princeton, New Jersey, think tank sponsored by the Japanese computer giant NEC, Giles has focused particularly on the area of machine learning, trying to teach computers to think for themselves.

Training them, first of all, to recognize patterns in large bodies of information. Giles's work traces back to some pretty fundamental stuff: theories about the way humans think.

“I'm not really a theorist,” he asserts. “I'm more of an experimentalist. I like to try things out, see how far we can go, what works and what doesn't.” To a guy like that, the Web, once it began to explode in the mid-1990s, was simply irresistible. If it was a testing ground that Giles wanted, here was one beyond imagining. Not a musty, lab-worn data set, but a great amorphous sea of real-world information. If ever there was a place where machine learning might be put to good use, he thought, this was it.

The Web exists as a distributed sort of information base,” Giles says, with typical understatement. Un-regulated, decentralized, the work of tens of millions of disparate authors, and constantly growing at an ever-accelerating rate, the Web is no easy object to take the measure of. Yet characterizing the Web, understanding its parameters and its behavior, was the first thing Giles set about doing. “What's there, how it is connected, how it changes, who uses it, why they use it—the more you know about these things, the more efficiently you're able to use it,” he says.

Giles's forays into what he calls “the basic mathematics of the Web” have overturned some rather basic assumptions. In a recent study, for example, he and colleagues at NEC Laboratories, where he remains a consulting scientist, showed that despite its bottom-up, heterogenous nature, the Web organizes itself rather neatly into “communities” of closely related pages, which are easily identified by analyzing the structures of links that grow up between sites.

In another study, published last year in the Proceedings of the National Academy of Sciences, he and his co-authors challenge the widely held notion that the competition for attention on the Web is purely “winner-take-all,” i.e., that new sites on the Web are more likely to attach themselves to sites that already have many links, insuring that a small number of established sites will always receive a disproportionate share of Web traffic. While this preferential behavior does accurately describe the Web as a whole, Giles and his co-authors write, it varies significantly by the type of site considered. Thus, while a new newspaper or entertainment site might find it difficult competing with similar sites that are already popular, university sites and the pages of individual scientists exhibit a more egalitarian link growth. “The behavior is more complicated than had been thought,” Giles says.

Five years ago, he and colleague Steve Lawrence attracted a lot of media attention when they published the results of what was the first serious attempt to estimate the Web's size.

“No one had modeled it before,” Giles says. To do so, he and Lawrence used a variation of capture-recapture analysis, a statistical technique pioneered by population biologists.

“It's an old idea, first published in the 1890s,” Giles explains. “If I want to estimate how many frogs are in a pond, I capture a number of them, label them, throw them back, wait for them to mix back in, then do another capture and see how many frogs are re-caught. The number initially captured times the number captured the second time, divided by the intersection—the number captured both times—equals your estimate.”

To translate that to the Web, Giles and Lawrence looked at the overlap of pages “captured” by six leading search engines—Hot Bot, AltaVista, Northern Light, Excite, Infoseek, and Lycos—in response to 575 randomly selected queries by employees of NEC Laboratories. (Most NEC employees are scientists, the authors reasoned, “and scientists tend to search for less ‘popular,' or harder-to-find, information”: This would give them a more accurate estimate.) Not content to simply compare the number of documents returned by each engine, Giles and Lawrence down-loaded and analyzed every single page that each engine listed—about 150,000 pages in all.

When all the frogs were counted, the Web turned out to be a lot bigger than anybody thought, containing a minimum of 320 million pages as of April 1998, when the study was published in the journal Science. (By February 1999, that minimum figure had leaped to 800 million, and as of March 2003, it was several billion. By comparison, the world's largest library, the Library of Congress, holds 120 million items.)

gray sticks and blue and white ballsSteve Coast

Perhaps more important than what it showed about the sheer size of the Web, Giles says, was what this early study implied about the accessibility of the information published there.

“Before our study people were saying ‘If it's not indexed, it's not there.' Search engines were claiming they indexed everything. We showed that the best search engines indexed only a third of the total number of documents on the Web. Some of them did much less.”

Combining the results of multiple engines significantly improved search results, Giles and Lawrence found. But even all six engines combined found only two-thirds of what was out there. The rest was simply beyond reach.

Today's search engines come in two types. Directory-based engines, like Yahoo, are still built manually. “What that means,” Giles says, “is that you decide what your directory categories are going to be”—Business, and Health, and Entertainment—“and then you put a person in charge of each category, and that person builds up an index of relevant links. This used to be the function of librarians, right?” Crawler-based engines, like Google, employ a software program—called a crawler—“that goes out and follows links, grabs the relevant information, and brings it back to build your index,” Giles explains. “Then you have an index engine that allows you to retrieve the information in some order, and an interface that allows you to see it. It's all done automatically.”

Directory-based engines offer the inimitable benefit of human expertise in a given subject area. Alas, Giles notes, “Humans don't scale very well.” As a result, as information grows deeper, crawler-based engines become more and more popular.

But automatic engines have their limitations, too. For one thing, most current crawlers are unable to recognize “spam,” which in this context means unreliable information. In the unregulated environment of the Web, Giles says, “people claiming to be what they're not is an ongoing problem.”

An even more basic problem, however, is that crawlers, by their nature, depend on the links between sites to get from one place to another. Sometimes, Giles says, “links don't exist.” And if links don't exist, sites that might be relevant don't get found. “It's the old problem of ‘you can't get there from here.'”
Most casual Web users have been willing to accept these limitations. As Giles puts it, “The Web has both increased and lowered expectations. We're willing to say it doesn't have to be perfect, just give me something close, but give it to me now. Search engines have taken advantage of that.”

As the Web continues to grow, however, and to be more and more important for commerce, communication, and research, information-retrieval problems become a more serious handicap. The percentage of Web content that shows up on search engines continues to dwindle. And as search engines struggle to add more and more content, the information they provide may be increasingly out-of-date.

“The search tools we're currently using are really in the primitive stage,” Giles says. One overdue step in their evolution, he avers, is the emerging trend toward personalization.

“Right now it's pretty much one size fits all. But why should one search tool be appropriate for all of us? We don't all drive the same car, or wear the same clothes, or have the same interests.”

Some day, Giles suggests, search engines may be completely individualized. Users will design their own directories, based on their own personal needs and interests. “Already, you can download the software to build your own small engine. I have my students do this.”

A more practical solution, at least in the short term, is what Giles calls the niche search engine, designed specifically to meet the needs of a group of people with similar interests: employees of a company, say, or members of a profession. By limiting its crawling to a specific subject area, the niche engine can burrow deeper, providing more consistently useful information. A prime example is CiteSeer, a tool that Giles and Steve Lawrence created for the field of computer and information science.

CiteSeer crawls the growing body of computer-science literature available on the Web and ignores everything else. Because the amount of information it finds relevant is relatively small, it can offer users important features that generic engines can't.

In addition to allowing for keyword search, for example, CiteSeer indexes all its documents by citation. It even provides the context of each citation for easy reference, as well as links to citing documents, authors, and institutions. “It can help users see how important a given article has been within the field, and show the relationships between ideas,” Giles says. CiteSeer also allows users to submit links and content updates, making it more current and accurate than generic engines. It can do all these things automatically, Giles says, because its searches are strictly limited—to one subject area, but also to a single, standardized type of document: the scientific paper.

Within its specialized realm, CiteSeer has proved itself tremendously useful. The engine now catalogs some 500,000 papers, and adds 10,000 more every month, Giles says. It receives over 100,000 visits per day. As another measure of its perceived value, a significant portion of the papers it indexes now are not found and retrieved by its crawler but are submitted by its users.

CiteSeer's success has inspired Giles to build similar tools for other “domains.” Early this year, he unveiled eBizSearch, a niche engine for practitioners and students of e-business, built on the same software platform that powers CiteSeer. “The same thing can be done for biotechnology, or physics, or any other clearly defined subject area.”

The ultimate goal, Giles says, is to create search engines that incorporate artificial intelligence.

“CiteSeer has a little bit of smarts in it,” he concedes. Enough to find relevant scientific papers and then parse them for citations and context. But not enough to tailor its responses to an individual user.

“To do intelligent information retrieval, the computer has to understand what I've said,” Giles says. That's called natural language processing. “It might also have to understand what I need, even when I don't say it. And that requires some knowledge of the user.”

Natural language processing is already possible within tightly circumscribed domains. “It works well in something like a call-answering center. You use keyword recognition—all you need is simple understanding. But that's an area of specific, pre-conceived knowledge. And even so, you usually have to have a default human, because those systems break down pretty easily.”

As for search engines learning the needs of specific users, it's not that the information isn't accessible, Giles says. “That's a big misconception. Everything you do is known by the search engine, unless you go through an anonymizer (a program that scrambles user identity). A search engine can set cookies, keep logsÖWhen a kid types in a request that says, ‘I need information on velociraptors for my 7th grade report which is due next Tuesday,' there's an awful lot of useful information embedded there.”

Privacy issues aside (“I think these issues can be resolved within certain trusted domains,” Giles says), on a technical level, researchers haven't yet figured out how to turn that information into intelligence.

Data mining is one basic approach. “This is a term out of database computing, where you're primarily interested in extracting interesting associations from large unstructured databases,” Giles explains. “Phone companies were the first to get into this. They were noticing that people would open up an account, not pay their bills, get shut down, then two months later they would open up another account under a slightly different name.

“Well, although you may change your name, you still call the same numbers. So the answer for the phone company was to build up a pattern of use, and identify non-payers that way. That's data mining.”

The next step toward intelligent search, Giles says, is knowledge extraction. “Instead of just mining associations, you're assigning meaning to them. Search engines are doing this in a very simple way when they give you, in response to a keyword, not just a list of relevant sites but a ranking.”

He smiles. “Did you ever try ‘Ask Jeeves?'” (Ask Jeeves.com is a popular natural-language search engine that parses a user's question into relevant keywords.) “It used to be it would come up with some pretty funny stuff. You could type in ‘What's the best way to deliver a baby?' and it would say ‘Fedex,' or ‘UPS.' Now you can actually get obstetrical information.

“This is a result of knowledge extraction. But Ask Jeeves is still not very smart. It's more building a big look-up table based on keywords and word associations. Instead of focusing on ‘deliver,' it will focus on ‘deliver a baby.' But the system isn't learning automatically. It's people who are looking at responses to questions and figuring out how to improve them.

“The ideal for something like CiteSeer would be to make it intelligent enough so that it could understand not just words of text, but equations and figures and whole papers. For that you need more than natural language processing, you need semantics. That's because the computer has to be able to get a sense of what a given paper is about, to tell a review paper from an experimental report. It has to be able to look at a paper reporting an experiment and distinguish: What are the new results here? Are they really new?

“Before that can happen,” Giles says, “a lot of fundamental issues remain to be addressed.

“AI started out with knowledge-based structures of intelligence,” he explains. Like today's so-called “expert systems,” these structures consisted of large collections of “if-then” rules pertaining to a given subject area, coupled with inference engines which allowed the system to make deductions based on these rules. “They worked well in very focused domains,” Giles says, “but in the real world they were awfully brittle. And they had to be built by hand.”

Then came machine-learning. “Here, the machine would learn by example. You had to train it. An example would be neural networks—learning systems modeled on what the brain was thought to do. But even though you could train these systems in a given task, it was hard to figure out what they actually did.

“There's a famous example: credit-card fraud detection. You can train a machine to detect one pattern as fraud and another as legitimate, but what is it really doing?

“We don't know, because we don't have a real working theory of intelligence in a computational sense. And I would argue that even our cognitive models are pretty primitive.”

That lack hasn't prevented the development of many useful applications to fill very narrow niches: In addition to sniffing out credit-card fraud, computers now engage in mortgage evaluation, handwriting analysis, medical diagnosis, and, in a limited way, smart Internet searches.

But before any broader, more sophisticated sort of intelligence can be placed into a machine, Giles says, we humans will have to get a better grasp on just what intelligence is.

C. Lee Giles, Ph.D., is David Reese professor of information sciences and technology and director of the Intelligent Systems Research Laboratory in the School of Information Sciences and Technology, 001 Thomas Building, University Park, PA 16802; 814-865-7884; clg20@psu.edu. He is also a consulting scientist in computer science at NEC Laboratories in Princeton, NJ.

Last Updated May 1, 2003