Real or Not? NLP with Disaster Tweets

2020-03-03

Table of contents

Overview

The goal of this challenge is to get started with NLP concepts and apply them to predict whether the text content of a social media post such as a tweet is disastrous or not.

0. Libraries used

import pandas as pd
import matplotlib.pyplot as plt
import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn import model_selection
import math
[nltk_data] Downloading package stopwords to /home/harsha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

1. Basic EDA

1.1 Loading the dataset

train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

1.1 Fields

The train data mainly has the following fields

  1. text - The tweet itself
  2. keyword - A keyword from the tweet
  3. location - Location where the tweet was posted from
  4. target - A column representing whether the tweet was disastrous or not

The test data contains all the fields as train set except the target field

print(train_data.columns)
Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

1.2 Shape of data

Shape of train and test dataset are as shown below

print("Shape(train_data): ", train_data.shape)
print("Shape(test_data): ",test_data.shape)
Shape(train_data):  (7613, 5)
Shape(test_data):  (3263, 4)

1.3 Missing values

train_data.isnull().sum()
id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

1.4 Distribution of labels

This dataset appears to be balanced in therms of the distribution of labels

train_data['target'].value_counts()
0    4342
1    3271
Name: target, dtype: int64
plt.bar(train_data['target'].value_counts().index, train_data['target'].value_counts())
plt.ylabel("Counts")
plt.show()

png

1.5 Some disastrous tweets

print(train_data[train_data['target']==1]['text'][:10])
0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
5    #RockyFire Update => California Hwy. 20 closed...
6    #flood #disaster Heavy rain causes flash flood...
7    I'm on top of the hill and I can see a fire in...
8    There's an emergency evacuation happening now ...
9    I'm afraid that the tornado is coming to our a...
Name: text, dtype: object

1.6 Top 20 Keywords

plt.barh(y=train_data['keyword'].value_counts()[:20].index, width=train_data['keyword'].value_counts()[:20])
plt.show()

png

2. Text Pre-processing

Before we build or train any model, it is important to perform some cleanup operations to bring in consistency in the data and also eliminate any parts of the data that don't contribute much. Below are some basic data cleanup operations

2.1 Basic cleanup

This step includes

  1. convert text to lowercase
  2. remove links
  3. remove brackets, words that contain numbers
def sanitize(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

train_data['text'] = train_data['text'].apply(lambda x: sanitize(x))

3.2 Removing stopwords

Another useful technique to remove noise from data is to eliminate stopwords. These are some of the most commonly occurring words such as the, a, an, are, etc that don't contribute much information when it comes to NLP.

There are so many such stopwords and the library nltk offers an extensive collection of such stopwords. This method removes any stopwords in the data.

def remove_stopwords(words):
    res = []
    for w in words.split(' '):
        if w not in stopwords.words('english'):
            res.append(w)    
    return ' '.join(res)
train_data['text'] = train_data['text'].apply(lambda x: remove_stopwords(x))

train_data['text'].head()
0         deeds reason earthquake may allah forgive us
1                forest fire near la ronge sask canada
2    residents asked shelter place notified officer...
3     people receive wildfires evacuation orders ca...
4    got sent photo ruby alaska smoke wildfires pou...
Name: text, dtype: object

4. Text Processing

4.1 Bag of Words

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document

scikit-learn's CountVectorizer() offers an abstraction to convert our text column into a bag of words where every column represents a different word

count_vectorizer = CountVectorizer()
train_vectors = count_vectorizer.fit_transform(train_data['text'])
test_vectors = count_vectorizer.transform(test_data["text"])

4.2 Transforming text to vector using TF-IDF

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF = (Number of times term t appears in a document)/(Number of terms in the document)

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
train_tfidf = tfidf.fit_transform(train_data['text'])

5. Building a model to classify the Text

At this stage, the tweet data from the dataset has been transformed to vectors using the TF-IDF approach and is now ready to used by a Classification Model. In this case, we will be using the Multinomial Naive Bayes classifier

model = MultinomialNB()
score = model_selection.cross_val_score(model, train_vectors, train_data["target"], cv=5, scoring="f1")
model.fit(train_vectors, train_data['target'])
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

6. Submitting Predictions

submission = pd.read_csv('data/sample_submission.csv')
submission['target'] = model.predict(test_vectors)
submission.to_csv('submission.csv', index=False)