The Search Engine Watch Blog just pointed me to a post that pointed me to a paper entitled “Large-Scale Named Entity Disambituation Based on Wikipedia Data” (PDF) by Silviu Cucerzan of Microsoft. Most of the paper is technical and algorithmic greek to me, but this one sentence makes perfect sense to me (or rather, the first half of the sentence makes sense to me and the rest should have been its own sentence).
The application on a large scale of such an entity extraction and disambiguation system could result in a move from the current space of words to a space of concepts, which enables several paradigm shifts and opens new research directions, which we are currently investigating, from entity-based indexing and searching of document collections to personalized views of the Web through entity-based user bookmarks. (page 9, my emphasis)
One of the main gripes about web search has been that it can’t benefit (at least not very well) from human knowledge of the relationships between words. It matches letters in a row while directories gather pages based on concepts. But then, directories got out of hand and people couldn’t keep up with them, and then they all turned into search engines no matter how much they still tried to look like directories… Well, if this thing gets off the ground, we could have the best of both worlds. We could have the scale of search with the power of directories.
Another source of this conceptual linking that’s so powerful and so difficult to teach to computers may be the Google Book Project. Think of all of those indexes to all of those books. Surely we could harness the power of generations of indexers to map concepts. How hard could it be? ;)