How to filter out non-English data from csv using pandas

Question

I'm currently writing a code to extract frequently used words from my csv file, and it works just fine until I get a barplot of strange words listed. I don't know why, probably because there are some foreign words involved. However, I don't know how to fix this.

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, 
TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib
from matplotlib import pyplot as plt
import sys
sys.setrecursionlimit(100000)
# import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\nlp_dataset\\commitment.csv", encoding='cp1252',na_values=" NaN")

data.shape
data['text'] = data.fillna({'text':'none'})
def remove_punctuation(text):
    '' 'a function for removing punctuation'''
    import string
    #replacing the punctuations with no space,
    #which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    #return the text stripped of punctuation marks
    return text.translate(translator)

#Apply the function to each examples 
data['text'] = data['text'].apply(remove_punctuation)
data.head(10)

#Removing stopwords -- extract the stopwords
#extracting the stopwords from nltk library
sw= stopwords.words('english')
#displaying the stopwords
np.array(sw)

# function to remove stopwords
def stopwords(text):
    '''a function for removing stopwords'''
        #removing the stop words and lowercasing the selected words
        text = [word.lower() for word in text.split()  if word.lower() not in sw]
        #joining the list of words with space separator
        return  " ". join(text)

# Apply the function to each examples
data['text'] = data ['text'].apply(stopwords)
data.head(10)

# Top words before stemming  
# create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text dta
count_vectorizer.fit(data['text'])
# collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items() 

#store the vocab and counts in a pandas dataframe
vocab = []
count = []
#iterate through each vocav and count append the value to designated lists
for key, value in dictionary:
 vocab.append(key)
 count.append(value)
#store the count in pandas dataframe with vocab as indedx
vocab_bef_stem = pd.Series(count, index=vocab)
#sort the dataframe
vocab_bef_stem = vocab_bef_stem.sort_values(ascending = False)

# Bar plot of top words before stemming
top_vocab = vocab_bef_stem.head(20)
top_vocab.plot(kind = 'barh', figsize=(5,10), xlim = (1000, 5000))

I want a list of frequent words ordered in a bar-plot, but for now it just gives non-English words with all-same frequency. Please help me out

check this https://stackoverflow.com/questions/27084617/detect-strings-with-non-english-characters-in-python and try and apply to your data — anky, Dec 27 '18 at 10:40

Venkatachalam · Accepted Answer · 2018-12-28T15:47:49.167

The problem is that you are not sorting your vocabulary with counts instead with some unique ID created by count vectorizer.

count_vectorizer.vocabulary_.items()

This doesn't contains the count of each feature. count_vectorizer don't save the count of each feature.

Hence, you are getting to see the rarest/mis-spelled words (since these gets more change of larger value - unique ID) from your corpus in the plot. The way to get the counts of the words, is by applying transform on your text data and sum the counts of each word on all documents.

By default, tf-idf removes the punctuation and also you can feed a list of stop words for the vectorizer to remove. Your code can be reduced as follows.

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document ?',
]

sw= stopwords.words('english')

count_vectorizer = CountVectorizer(stop_words=sw)
X = count_vectorizer.fit_transform(corpus)
vocab = pd.Series( X.toarray().sum(axis=0), index = count_vectorizer.get_feature_names())
vocab.sort_values(ascending=False).plot.bar(figsize=(5,5), xlim = (0, 7))

Instead of corpus, plug in your text data column. The output of the above snippet will be

How to filter out non-English data from csv using pandas

1 Answers1