Teaching a Computer to Read

David Pacchioli
December 01, 2004

Dan Heinze once spent a year on a Navajo reservation in Arizona, scrutinizing the language and culture of a civilization as old as the desert. A master of divinity, he has served as acting dean of a Presbyterian seminary. He taught Greek and Hebrew and history.

"Then," he says matter-of-factly, "came the arrival of my second child," and with it the realization that teaching "was a method of slow starvation."

Heinze did what any self-respecting underpaid theologian-linguist would do. He shifted gears. He went back to school. He crammed 40 credit-hours of math into one pop-eyed year, landed a job as a computer programmer, got a master's degree in computer science, took another job as a research engineer with a defense contractor, and then completed a Ph.D. in industrial and management systems engineering at Penn State.

Now the mild-mannered, boyish, early-to-rise Heinze, who drives a vintage convertible and wears his hair in a style reminiscent of the early Beatles, has brought his linguistics training and his computer-science training—and who knows, maybe his training in theology, too—to bear on a devilishly complex problem that's been a particular thorn for a certain segment of the scientific community since Alan Turing and the post-war dawn of the so-called thinking machine.

He is teaching a computer how to read.

Communicating with machines in human, or natural, language has been a dream of computer programmers since the late 1940s. In those Cold-War days, the first thought was that so-called natural language processing would be useful for speeding translation of military and scientific documents. Machine translation became the watchword in the brand-new artificial-intelligence community.

"There was a lot of enthusiasm for it," Heinze says.

It seemed a reachable goal. All programmers had to do was figure out the rules that govern language and then plug them into the computer.

At the time, this seemed a reasonably straightforward task. Linguistics was dominated by an approach known as formalism, most prominently associated with the theorist Noam Chomsky. Formalism espoused that language could be reduced to a finite set of well-defined rules from which all utterances flowed. Language was a closed system, of mathematical precision. So clear-cut were its rules of order that a given statement could be effectively "understood" simply by reference to its syntax—the way it was put together—without regard for its semantic content, or meaning.

Chomsky's formalist approach turned out to be exceedingly important to the brand-new science of programming. Its unwavering precision provided a strong theoretical backbone for the development of higher-level computer languages like FORTRAN, COBOL, and LISP. With natural languages, however, it was something of a giga-flop.

"The artificial languages are nicely defined," Heinze explains. "There's no ambiguity. But humans don't speak in such formal terms. In our languages, ambiguity is inherent."

Oh, formalism was tried, all right, and with some success. Early efforts at automatic translation could mimic sentence structures pretty well, and the best of them could usually muster a fair degree of accuracy in putting across the sense of a message. But literal accuracy, when it came to human language, turned out to be not nearly enough. Ambiguity raised its head. To put it bluntly, something was lost in the translation.

There's a well-known anecdote, probably apocryphal, that illustrates some of the frustration of these early forays. It goes something like this:

The English phrase, "The spirit is willing but the flesh is weak," was dropped into a natural-language processor and translated into Russian. It was subsequently reversed again to English, to which it returned literally intact and in good time. Still, there was something ineffably, well, different about the twice-processed message. "The vodka is good," the new sentence read, "but the meat is rotten."

By the late '60s, the U.S. government, which had underwritten most of the work in natural-language processing in this country, lost patience with the progress researchers were making in the area. A report issued by the National Academy of Sciences recommended curtailment of funding.

Still, research into artificial intelligence went on, wounded but not dead. Important advances were made in the areas of knowledge representation and linguistic theory.

Around this time, also, cognitive psychology began to be ascendant. In contrast to behavioralism, which viewed the human brain as a black box whose inner workings could never be fathomed, cognitivism, as Heinze explains it, "was based on the premise that it is possible, by scientific investigation, to come to some understanding of how the brain works."

For computational linguists, this meant an entirely new approach to the problem of natural-language processing. Where the behavioralist could only look at output—language utterances—and try to discern some abstract pattern in them that might be duplicated on computer, the cognitivist hoped to find out how the brain functions in regard to language, and, having done so, to try to simulate that mechanism in a machine.

This was easier said than done, of course. What it meant was, first of all, arriving at a new theory of language. The so-called functionalist approach holds that language is an open system, not bounded by a finite set of rules but unbounded, ever-changing, always adapting in response to its environment.

So what happens when language meets brain?

Hearing speech, we perceive an acoustic waveform, analyze it as sounds, put the sounds together as words, and connect the words into sentences, to which we assign meaning. Then we interpret this meaning according to a context: the current situation, previous discourse, awareness of the speaker's intentions.

Each of these steps requires a different scope of knowledge. Once we get to the level of semantics, or meaning, that knowledge goes far beyond mastery of linguistic rules. As Roger Schank, a leading AI researcher, puts it, "Understanding a sentence involves all the knowledge we have so far acquired about what goes on in the world."

But set aside knowledge requirements for a moment and let's complicate things a little by looking again at the process.

Those steps: do they really happen independently, one after the other? The formalists act as if they do. Chomsky writes of the autonomy of syntax. He has spent most of his career (his linguistics career, that is) unearthing the rules that govern the structures of language. For the last twenty years, in fact, he and his students have been looking for the rules that compose what he calls the Universal Grammar, or UG: the set of properties shared by all human languages.

Functionalists, on the other hand, stress that the levels of language interact. They focus particularly on the way the brain seems to use semantic knowledge to efficiently constrain syntactic processing. Michael McTear, author of The Articulate Computer, published in England in 1987, uses a set of sentences to illustrate this kind of narrowing.

  • John hit the boy with the cricket bat.
  • John hit the boy with the red hair.
  • John hit the ball with the cricket bat.

Without a context, McTear asserts, the first of these sentences is ambiguous: a reader doesn't know whether John used a cricket bat to hit the boy or he hit the boy who was holding a cricket bat. But the second and third sentences, although similar in structure to the first, are not ambiguous. In each latter case, our semantic knowledge renders the plausible meaning.

Modeling that incorporates this kind of interaction, functionalists say, is crucial to effective natural-language processing. As McTear points out, a sentence containing seven words each of which had three different meanings would give rise to 2,187 different readings. A like sentence of 14 words would have 4,782,969 possible versions. "Obviously," he concludes, "it would be inefficient to have to produce all these readings for subsequent analysis . . . if most of them could be discarded earlier using higher-level knowledge."

Until recently, however, functionalist attempts at language modeling have been stymied by a couple of realities. First of all, there has been no unified, overarching functionalist theory of language. For another thing, the functional approach is much harder to computerize.

The first of these problems was solved by Ronald Langacker, a professor at University of California at San Diego. In the late '70s, Langacker, a linguist who had grown frustrated with formalism, began work on the first real systematic approach to language from a functional, or cognitive, perspective—an approach that had come to be known as cognitive linguistics.

Heinze came into natural language processing through the back door. He was working for HRB Systems, a central-Pennsylvania defense firm, on problems in signal processing. "I had not really consciously aimed at combining computer science with my linguistics training," he remembers. "But one of our clients was the National Security Agency. They of course have huge archives, and they wanted a more efficient way to locate documents of interest." A key-word system was not sophisticated enough, because it could not differentiate between an incidental use of the search word and a case where the search word was the real topic of the document. What they needed, essentially, was a system that could understand.

Heinze set to work. A four-year research program yielded a workable document-routing system, one that would satisfy the NSA's needs. Heinze submitted a proposal for implementation of the system in September 1994.

Working on the project convinced Heinze that he wanted to go further in computational linguistics. He spoke with Soundar Kumara, Penn State associate professor of industrial and management systems engineering and computer science and engineering, who directs the University's Intelligent Design and Diagnostics Lab, and Kumara arranged a program whereby Heinze would work out a natural-language processing system as a Ph.D. project. It would fit nicely into the artificial-intelligence activities of Kumara's lab.

Internal research funding from HRB Systems allowed Kumara and Heinze to bring Dan Davenport onto the project. Davenport, a Ph.D. in mathematics who had studied formal linguistics, had the expertise to program the complex data structures and algorithms a natural-language processor would need.

Heinze began investigating language theory, trying to get the necessary understanding to proceed with building a system. He learned about the shortcomings of formalist approaches to natural-language processing, and noted the promise, and the incompleteness, of functionalist approaches. Then he came across Langacker's work: a unified and comprehensive theory that claimed to mimic the way the brain processes language.

"As soon as I saw it, it piqued my interest," Heinze says. The immediate question, though, was whether Langacker's theory could be adapted for computer.

"Langacker is not a mathematician," says Heinze, "and he has spoken out against the dominance of math in linguistics. He consciously avoided stating things in mathematical terms."

Together, Heinze and Davenport sat down and studied Langacker's opus, Cognitive Linguistics. "In the mornings we read," Heinze recalls. "In the afternoons we would get together and talk about it.

"Like swimming through peanut butter," Davenport says now, hoisting one of two thick volumes.

Yet by the time a few months were up, they had figured out what Langacker was talking about. The question remained: Could they build a computer system that would accurately capture Langacker's ideas?

The crux of the matter, as Heinze saw it, would be to figure out an effective system for representing and manipulating knowledge—basic concepts or data structures and their relationships—in this model brain. "Devising the correct data structure is the key to successfully automating any intelligent behavior," he says.

Take the concept "dog." A realistic knowledge-representation scheme, according to Paul Thagard, a theorist Heinze quotes in his dissertation, must at the least be able to do several basic things with this concept. It should enable our computer to recognize a dog when it sees (or reads about) one. It should help it remember things about dogs—both about dogs in general and about particular incidents involving dogs. It should enable it to make inferences about dogs from what it already knows, learn new facts about dogs from additional examples, reason about objects that are similar to dogs, and generate explanations for a given dog's behavior. All this in addition to being able to understand the word dog when it pops up in text, and to respond with it when appropriate.

"At a theoretical level," Heinze writes, "Langacker's explanation of cognition seems capable of useful performance against all these roles." But how to make it happen?

Heinze looked at many standard methodologies for organizing knowledge, including "is-a" hierarchies, tree-like structures of categories and sub-categories. (A house is a building. A parking deck is a building.)

Then one day Heinze had a revelation. "I walked into Dan's office," he remembers, "and I said: 'You know, we're going to have to think about things very differently.' "

He was thinking about the ways knowledge is standardly categorized in AI. With is-a methodologies, he explains, the starting point for a given concept—the root or "entity"—is nothingness. "It's expressed as the null set. There's nothing you can say about it. It has no characteristics." The way you build meaning into a concept is by adding semantic "markers": A mammal is an animal, it has fur, it is warm-blooded, etc. You build categories.

Then, when you encounter something new, you search your existing knowledge by matching markers. As Davenport puts it: "It's warm-blooded, it has fur, it has four legs and a tail that wags, it barks—it must be a dog!"

"But," says Heinze, "when you categorize things this way, there will be exceptions. "What do you do with a platypus—a mammal that lays eggs?" (According to your hierarchy a mammal is an animal that bears its young live.) Or again, as Davenport says, "What happens when you run into a three-legged dog?"

In such a situation, says Heinze, "You can only go two ways. Either you have to start disinheriting your exceptions—which means the system is nonmonotonic, which means you don't compute consistently, which leads to problems with searching—or you enumerate—every possible thing is its own class." In either case, you lose.

But what if, Heinze wondered that day, we turn the standard hierarchy on its head?

"Instead of starting with a root that contains nothing, we start with a root, we call it 'entity,' that has everything. Anything is possible. Entity includes all possible meaning."

The up-shot of this brainstorm was L-space ("L" stands for lattice), a complex of data structures that functions more like a web than a ladder.

In L-space, Heinze explains, to create information you start from the bottom, or entity, which he describes as a flat line, and you raise certain semantic areas—certain bits of meaning—to salience, a level of relative importance. "If we raise the levels of areas prototypical of mammals, we get a profile that we recognize as a mammal. Modify those characteristics further and we get a dog." Like the human brain, as Langacker envisions it, L-space keeps adding qualifying words or concepts to reduce ambiguity, creating a profile or scene.

Crucially, in L-space, the concept of "mammal," for example, includes many characteristics we wouldn't ordinarily associate with a mammal, things like scaliness and ability to fly. What makes it a "mammal" is that some of its characteristics are more salient than others.

"For example," Heinze says, "could a mammal lay an egg? It's not very likely. A bird is going to have a much higher salience for the category of 'lays eggs.' But how about the platypus? In this system, there's room for it.

"What it enables us to do is very quickly perform a search on an entire knowledge space—leaving nothing out—and come to the solution which best fits the problem.

"It matches the way your mind works—or so we would like to think."

For his Ph.D. thesis, Heinze, under Kumara's direction, developed the data structures and algorithms that would bring L-space to life. Then he, Kumara, Davenport, and a team of graduate students from Kumara's lab proceeded to build around it, creating what Heinze calls the first end-to-end natural language processing system based on a cognitivist theory of language. They named the creature Computational Cognitive Linguistics, or CCL.

Kumara is a slight, quiet man who has worked for ten years in various forefronts of AI, from expert systems to parallel processing, always with an eye toward industrial applications. Lately his work in computational linguistics has led him into an exploration of Sanskrit grammar. The walls of his office, however, reveal a more purely scientific bent: they are plastered with large posters of Albert Einstein. "This is what we wanted a computer to do," he explains. "First recognize, then understand, then make inferences.

"To read and make sense," he adds, musing, "so easy for you and me, is an extremely difficult problem to automate."

Computerized text processing, Kumara calls what CCL does. He offers a brief descriptive tour.

An incoming natural-language text arrives first at the system's preprocessor. "It's messy. You clean it up for clarity. Standardize its form. Get rid of garbage."

Next, "you look closely at the body of the message, try to recognize the category of its content." Is it a topic of current interest, or not important? "You evaluate using rules created from prior observation."

The next step, then, is to try to understand. This is a process of narrowing down, or L-space translation. The message is mapped into concepts, each of which is taken through L-space, first individually, then in various combinations. L-space tests combinations, and constructs meaning in a form it understands. "In essence," says Davenport, "this is something that translates English sentences into database language."

Kumara and Heinze first intended to apply CCL to the reading of industrial maintenance manuals. In large facilities, Kumara explains, where documentation on complex machinery is sometimes measured not by the page but by the ton, keeping up with the flow of information can be an overwhelming task.

In 1991, however, while they were working on the industrial system, Kumara's lab hosted a visit by Col. William Crowder, the officer in charge of the Army Logistics Office during the Persian Gulf war.

"During Desert Storm," Heinze explains, "the Logistics Office found itself overwhelmed with Telecom messages—what we would call e-mail—reporting on the status and deployment of troops and equipment." Crowder was looking for a better way to manage this heavy flow of information, to make sure there were no lags in the constant updating crucial to operational decisionmaking. In CCL, he saw a system he thought could be effectively adapted to the purpose.

At first glance, Kumara wasn't so sure. Studying stacks of e-mail printouts of the military's version of English, he realized that yet another level of translation would be in order. "I told Crowder I couldn't promise anything," he says now. But the lab set to work on the problem.

"The object," Heinze says, "was to create a system which—in real-time, or near it—would examine messages coming on the network, extract pertinent information about subjects of interest—the deployment of a certain unit, the status of a ship—and use that information to update planning databases as quickly as possible."

He adds: "It turned into a fairly large project."

By May of 1994, after two years of work by a team that ranged between eight and ten people, Heinze, Kumara, et al., were ready to unveil a prototype of their system before an audience of Pentagon officers. To the delight of everyone present, CCL passed the Army's proof-of-concept test, successfully recognizing, parsing, and understanding a long and convoluted sentence having to do with a delivery of some C-141 airplanes.

"Their AI people were really impressed," Davenport reports. That first-step success has led to continued funding from the Army, support that will allow the Penn State-HRB team to flesh out the CCL system. And it has led to other opportunities.

"There are many other applications for this kind of text processing," Heinze says. "Government documents, medical records, the futures market—anywhere where information is time-critical. The more quickly we're able to generate text, the greater the need for tools to digest it."

International business developments, too, such as new EEC regulations for multi-lingual documentation, should push forward new work toward the old object of automatic translation.

"The more electronic means for communication we develop," Heinze says, "the more valuable natural-language processing becomes."

Daniel T. Heinze received his Ph.D. in industrial engineering in May 1994. He is a principal engineer at HRB Systems, Inc., in State College, Pennsylvania. Soundar R.T. Kumara, Ph.D., is associate professor of industrial engineering and computer science and engineering, and director of the Intelligent Design and Diagnostics Research Laboratory, 207 Hammond Building, University Park, PA 16802; 814-863-2359. Daniel M. Davenport, Ph.D., is a senior engineer at HRB Systems, Inc.

Graduate students who have worked on the project reported are Karthik Chittayil, who received a Ph.D. in industrial engineering in August 1994; Ching-Yao Kao, Taioun Kim, Jinwhan Lee, and Dongmok Sheen, all Ph.D. students in industrial engineering; and Tim Thomas, a master's student in computer science.

Research funding was provided by HRB Systems, Inc., and by the Office of the Deputy Chief of Staff for Logistics, United States Army.

Last Updated December 01, 2004