TopicZoom uses a large semantic network for recognizing topics in texts. The nodes of this semantic network represent “concepts” such as topics, locations, time periods, persons, enterprises and organizations, events. Each node comes with a preferred standard name, which is always used to display the concept. In addition, language variants for a concept are stored with its node. Language variants are used during text analysis to recognize all mentionings of the concept in an input document. Mapping variants of concept names to a unique preferred name leads to a normalization and standardization of language expressions. Many concept names in the TopicZoom are multi-word expressions such as “New York State Opera”, “Angela Merkel”, or “French whine”. In this way, TopicZoom semantic text analysis does not rely on single words, but on real concept names.

The TopicZoom semantic net is organized as a hierarchical graph. Topmost nodes represent general fields, such as politics, sports, or econony. Children of a given node point to major subfields. In our human mental representation of the world, spatial notions such as “fields” and “subfields” help to organize the relationship between general thematic areas and specific topics. TopicZoom uses the same ordering principle not only for thematic fields, but also for geographic areas and temporal periods. Following the links of the net in downward direction leads from general fields to more and more specific fields, from large regions/periods to small subregions/subperiods. When looking in upward direction from a given node (“Economic policy”), we typically find several parent nodes (“Economy”, “Politics”).

During text analysis, if a hit is found for a concept (a node), all more general fields also receive a score. In this way, if “President Obama” is found in the text, then “U.S. politics” and furthermore also “United States of America” and “Politics” also obtain an improved count. For the thematic profile of an input documents, scores from all hits found in the text are accumulated. In this way, topics are recognized if subtopics are mentioned in the text. In the final TopicZoom ranking mechanism, a second scoring factor is added which guarantees that general fields are not preferred, but rather “eye-catching” topics of the text have the best ranking. For a given input text, the scores computed for the topics are semantic taggings which represent ideal subject metadata. These subject metadata can not only be used for text simple classification tasks, but provide an ideal basis for precise thematic search and subject-oriented search on a complete collection far beyond conventional keyword matching.

Print Friendly, PDF & Email