Automatic tagging and natural language processing

Extracting relevant topics from text is at the heart of Starmind's AI. For this reason, most of the research done by the AI Team at Starmind is focussed on natural language processing. The field of natural language processing aims to give machines the ability to structure and derive meaning from human languages. This article contains a high-level introduction of the relevant natural language processing algorithms and their role in the Starmind product. Additional details can be found in the blog post about AI Research at Starmind.

Supported languages

Starmind's natural language processing algorithms can currently process text in the following languages:

  • Chinese (traditional and simplified)
  • Croatian
  • Danish
  • Dutch
  • English
  • French
  • German
  • Greek
  • Hungarian
  • Italian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Ukrainian

We are continuously working on supporting more languages. The following languages are already planned for 2022: Czech, Finnish, Norwegian. If your organization is interested in any other languages that are not yet supported, then please let us know through your Customer Success Manager, and we will look into the possibility of adding these languages to our roadmap as well.

Tagging

Starmind represents the topics of a text (e.g. a question, an answer, or a document sent to Starmind by a connector) with tags. Tags are words (like "algorithm") or short phrases (like "artificial intelligence") that represent a specific skill or topic of expertise.

Automatic tagging

Starmind automatically identifies which tags are relevant for a given question, answer, or any other document. Starmind can also automatically create new tags if a question about a new topic comes up.

Behind the scenes, Starmind uses natural language processing algorithms such as part-of-speech tagging to identify nouns, adjectives and other relevant grammatical patterns, and ignore irrelevant words (so-called "stop words" such as "the", "and", etc.).

Starmind's AI comes preloaded with large datasets that provide real-world context about the importance and the meaning of different tags. In particular, Starmind has created its own ontology, a powerful dataset containing structured information about thousands of skills and occupations.

Starmind's tagging algorithms are also flexible enough to also deal with company-specific topics, such as internal project names, that cannot be found in the ontology. It can be useful to preload a list of company-specific tags into Starmind. However, this is not a strict requirement, as Starmind's AI will automatically create new tags as soon as a text containing such company-specific topics is processed. The more questions and answers are created that are related to these tags, the better Starmind will understand their meaning and the relationships between them.

Tag management

Starmind offers some possibilities for manual management of tags:

  • Question posers can modify the tags of their own questions.
  • Users with admin rights can delete, rename and/or merge tags in the Admin Area (under “Content management” - “Tags”).

Manual tag management should only be necessary in exceptional cases. If irrelevant or incorrect tags are being created on a regular basis by the automatic tagging algorithms, or if they are getting a high usage count, then this should be reported as a bug to Starmind.

Embeddings

While tags are visible to the end-user in Starmind's user interface (for example in the profile of a user), Starmind also uses additional techniques to understand natural language in an even more fine-grained manner. In particular, Starmind uses neural-network-based embeddings methods, whereby words, sentences or even entire documents are represented as mathematical vectors in a high-dimensional vector space. These vectors contain very detailed information about how different words, tags, questions and other documents relate to each other, even across different languages. For example, they can be used to identify words that are synonymous, or to detect questions that are duplicates of each other.

The embeddings cannot easily be visualized, but they are easy to use in mathematical computations behind the scenes. Consequently many state-of-the-art algorithms make use of such embeddings. Embeddings are crucial for making Starmind's expert search and user profiles more accurate and context-aware. Moreover, embeddings are also used in other components of the Starmind application, for example to provide more relevant search results in the question search.