What is NLP and what is BOW?

Natural language processing (NLP) is a popular field in Artificial Intelligence. Simply put, it uses text, such as news, twitter, comments or any paragraphs/sentences that are made with words, as input to conduct further analysis. Just like how you read and understand an article, NLP algorithms do the very similar.

Image result for nlp meme

However, where the difference lies between human and NLP algorithms is that the algorithms do not input the words literally, but rather they can only input them as numbers. So when you hear about NLP related topics, people are mostly talking about the ways they transform those text into numbers. And this is the key component of a NLP algorithm.

From a macro perspective, there are 2 ways to transform text to numbers.

  • Word frequency
  • Underlying meaning of text

In this article, we will talk about the first one, word frequency.

Bag of Words (BOW), is a commonly used algorithm of word frequency based NLP. It is used to represent the text as a bag of its words, disregarding order, meaning but keeping multiplicity. What does this mean?

For any NLP algorithms, we always transform them into tokenized form first, meaning treating each word separately. BOW counts each of the tokens (words) and represent the sentence as a vector (a combination of word counts).

Image for post
Image for post

Example:

Sentence1: Will is a nice guy. Will likes nice movies.

Tokenized1: “Will” “is” “a“ “nice” “guy” “Will” “likes” “nice” “movies”

BOW1: {“Will”:2, “is”:1, “a”:1, “nice”:2, “guy”:1, “likes”:1, “movies”:1}

Sentence2: William is a nice dude. William like good movies.

Tokenized2: “William” “is” “a“ “nice” “dude” “William” “like” “good” “movies”

BOW2: {“ William”:2, “is”:1, “a”:1, “nice”:1, “dude”:1, “like”:1, , “good”: 1, “movies”:1}

When we want to analyze these sentences together, the BOW will create a dictionary of words that contains are the words used, and then map them into corresponding vectors. What does that mean?

Dictionary: {“Will”, “is”, “a“, “nice”, “guy”, “likes”, “movies”, “William, “dude”,

“like”, “good” }

And what does the final output look like?

This is way, your text can be successfully transformed from text strings to vectors. The output can be passed on to analytics tools or machine learning models for further analysis.

Assuming we have the text, here is the code to implement the BOW method using sklearn library in python.

Code:

Image for post
Image for post

from sklearn.feature_extraction.text import CountVectorizer

# Initialize the “CountVectorizer” object, which is scikit-learn’s bag of words tool.

vectorizer = CountVectorizer(analyzer = “word”, \

tokenizer = None, \

preprocessor = None, \

stop_words = None, \

max_features = 5000)

# Note that CountVectorizer comes with its own options to automatically do preprocessing, tokenization, and stop word removal — for each of these, instead of specifying “None”, we could have used a built-in method or specified our own function to use.

# fit_transform() does two functions: First, it fits the model

# and learns the vocabulary; second, it transforms our training data

# into feature vectors. The input to fit_transform should be a list of

# strings.

train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an

# array

train_data_features = train_data_features.toarray()

we can look at the words the output contains using the following code.

vocab = vectorizer.get_feature_names()

print vocab

Lastly, we can also check the counts of each word in the vocabulary

import numpy as np

# Sum up the counts of each vocabulary word

dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it

# appears in the training set

for tag, count in zip(vocab, dist):

print count, tag

To conclude, through this article,

  • We talked about what NLP is and what BOW method is.
  • We offered a straight forward example and python code for implementation

We will talk about pro and cons, and how to improve your BOW in the near future.

Written by

Software consulting company that focuses on emerging technology such as AI, Blockchain, Cloud Computing, and Data Engineering, MERN Stack, and Fintech

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store