Wednesday, August 1, 2018

Gensim – Vectorizing Text and Transformations

In this article, we will discuss vector spaces and the open source Python package Gensim. Here, we'll be touching the surface of Gensim's capabilities. This article will introduce the data structures largely used in text analysis involving machine learning techniques — vectors.

Introducing Gensim

When we talk about representations and transformations in this article, we will be exploring different kinds of ways of representing our strings as vectors, such as bag-of-words, TF-IDF (term frequency-inverse document frequency), LSI (latent semantic indexing) and the more recently popular word2vec. The transformed vectors can be plugged into scikit-learn Machine Learning methods just as easily. Gensim started off as a modest project by Radim Rehurek and was largely the discussion of his Ph.D. thesis, Scalability of Semantic Analysis in Natural Language Processing. It included novel implementations of Latent Dirichlet allocation (LDA) and Latent Semantic Analysis among its primary algorithms, as well as TF-IDF and Random Projection implementations. It has since grown to be one of the largest NLP/information retrieval Python libraries, and is both memory-efficient and scalable, as opposed to the previous largely academic code available for semantic modeling (for example, the Stanford Topic Modelling Toolkit).



from DZone.com Feed https://ift.tt/2LUf9CW

No comments:

Post a Comment