The Semantic Web, Semantics and Uncertainty in Languages
A lot has been said about the Semantic Web and how it could change the face of the web. Indeed many well established, well credentialed computer scientists have put in a lot of effort into turning the semantic web into a reality. Surely a nobody like me has no right to doubt these people, some of whom are people I have looked up to for a long time. Yet, I am very sure that the Semantic Web, in its current form, would never be a reality.
There are reasons why I can say this with such conviction and authority. The main reason Semantic Web will be a failure is because it is designed by computer scientists who are most familiar with logic, discrete mathematics and symbolic processing. These are people who believe the world can be represented in purely logical terms. At the 2004 WWW Conference, Sir Tim Berners Lee has categorically stated as much when he said he believe the Semantic Web does not need to represent uncertainty.
In simple language, computer scientists tend to have very rigid definitions of the world. For example, a car might be defined as a powered machine with four wheels that can roll around. Under this definition, a car can never have three wheels. At this point, computer scientists would either tweak their rigid definition of cars to fit three-wheeled cars, or label the three-wheeled car as a "tri-car" and classify it as a cousin of the car under a common ancestor, "vehicles". And as more wierd cars get designed and produced, computer scientists would have to revise the rigid definition of what is a car.
Is this the only way to define stuff? Of course not.
Natural language processing researchers have been facing this exact same problem since the late 1950s, when Professor Noam Chomsky's pioneering work on the language hierarchy and specifically, context-free grammar. In simple layman terms, all human sentences has structure and this structure can be discovered by applying a set of fixed grammar production rules. For 20 years, researchers have tried to discover the ultimate set of production rules for human languages such as English. No one succeeded because human language is ambiguous in nature.
The simplest example I can give is the common linguist's example. The word "bank" has multiple meanings. "Bank" could refer to a financial institution where people can deposit money and earn interest. "Bank" could also refer to sloping land found beside a body of water such as rivers and lakes. In fact, WordNet lists 18 different possible meanings that can be attached to the word "bank". No amount of logic nor discrete structures that computer scientists love so much could accurately capture the meaning of human languages. In fact to this day, after nearly half a century of research effort, the research area of Word Sense Disambiguation (the task of identifying exactly which meaning is being used in a sentence or passage), particularly in what WSD researchers call the "all words" task, is still a very open research topic.
In the 1980s, significant progress was made when natural language researchers gave up on the use of purely logical or discrete systems and started employing statistical techniques. The real-world accuracy of language processing systems using discrete/logical methods was around the 30-50% mark. Using statistical methods, that accuracy jumped up to the 70-90% range. In some of the simpler research areas such as speech recognition, near-human performance of 95+% has been achieved.
I do not believe this is the forum to go into details as to why statistical methods can have such a huge improvement over logic based systems (if you really want a clue, look up on the Zip-Mandelbrot Law). I think it is sufficient to say that I believe all languages (yes, that includes programming and mathematical languages) are inherently ambiguous when it comes to conveying meaning (ie, semantics). As a result, a Semantic Web that does not incorporate statistics or uncertainty would never be successful in goal of transparent information and knowledge transfer on the web.
There are reasons why I can say this with such conviction and authority. The main reason Semantic Web will be a failure is because it is designed by computer scientists who are most familiar with logic, discrete mathematics and symbolic processing. These are people who believe the world can be represented in purely logical terms. At the 2004 WWW Conference, Sir Tim Berners Lee has categorically stated as much when he said he believe the Semantic Web does not need to represent uncertainty.
In simple language, computer scientists tend to have very rigid definitions of the world. For example, a car might be defined as a powered machine with four wheels that can roll around. Under this definition, a car can never have three wheels. At this point, computer scientists would either tweak their rigid definition of cars to fit three-wheeled cars, or label the three-wheeled car as a "tri-car" and classify it as a cousin of the car under a common ancestor, "vehicles". And as more wierd cars get designed and produced, computer scientists would have to revise the rigid definition of what is a car.
Is this the only way to define stuff? Of course not.
Natural language processing researchers have been facing this exact same problem since the late 1950s, when Professor Noam Chomsky's pioneering work on the language hierarchy and specifically, context-free grammar. In simple layman terms, all human sentences has structure and this structure can be discovered by applying a set of fixed grammar production rules. For 20 years, researchers have tried to discover the ultimate set of production rules for human languages such as English. No one succeeded because human language is ambiguous in nature.
The simplest example I can give is the common linguist's example. The word "bank" has multiple meanings. "Bank" could refer to a financial institution where people can deposit money and earn interest. "Bank" could also refer to sloping land found beside a body of water such as rivers and lakes. In fact, WordNet lists 18 different possible meanings that can be attached to the word "bank". No amount of logic nor discrete structures that computer scientists love so much could accurately capture the meaning of human languages. In fact to this day, after nearly half a century of research effort, the research area of Word Sense Disambiguation (the task of identifying exactly which meaning is being used in a sentence or passage), particularly in what WSD researchers call the "all words" task, is still a very open research topic.
In the 1980s, significant progress was made when natural language researchers gave up on the use of purely logical or discrete systems and started employing statistical techniques. The real-world accuracy of language processing systems using discrete/logical methods was around the 30-50% mark. Using statistical methods, that accuracy jumped up to the 70-90% range. In some of the simpler research areas such as speech recognition, near-human performance of 95+% has been achieved.
I do not believe this is the forum to go into details as to why statistical methods can have such a huge improvement over logic based systems (if you really want a clue, look up on the Zip-Mandelbrot Law). I think it is sufficient to say that I believe all languages (yes, that includes programming and mathematical languages) are inherently ambiguous when it comes to conveying meaning (ie, semantics). As a result, a Semantic Web that does not incorporate statistics or uncertainty would never be successful in goal of transparent information and knowledge transfer on the web.
0 comments:
Post a Comment