Member-only story
Natural language processing (NLP) is a popular field in Artificial Intelligence. Simply put, it uses text, such as news, twitter, comments or any paragraphs/sentences that are made with words, as input to conduct further analysis. Just like how you read and understand an article, NLP algorithms do the very similar.
However, where the difference lies between human and NLP algorithms is that the algorithms do not input the words literally, but rather they can only input them as numbers. So when you hear about NLP related topics, people are mostly talking about the ways they transform those text into numbers. And this is the key component of a NLP algorithm.
From a macro perspective, there are 2 ways to transform text to numbers.
- Word frequency
- Underlying meaning of text
In this article, we will talk about the first one, word frequency.
Bag of Words (BOW), is a commonly used algorithm of word frequency based NLP. It is used to represent the text as a bag of its words, disregarding order, meaning but keeping multiplicity. What does this mean?
For any NLP algorithms, we always transform them into tokenized form first, meaning treating each word separately. BOW counts each of the tokens (words) and represent the sentence as a vector (a combination…