1

I am doing a sentiment analysis project in Python (using Natural Language Processing). I already collected the data from twitter and saved it as a CSV file. The file contains tweets, which are mostly about cryptocurrency. I cleaned the data but there is one more thing before I apply sentiment analysis using classfication algorithms. Here's the out for importing libraries

# importing Libraries
from pandas import DataFrame, read_csv
import chardet
import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib import rc
%matplotlib inline
import pandas as pd
plt.style.use('ggplot')
import numpy as np
import re
import warnings

#Visualisation
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from IPython.display import display
from mpl_toolkits.basemap import Basemap
from wordcloud import WordCloud, STOPWORDS

#nltk
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer


matplotlib.style.use('ggplot')
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore")

%matplotlib inline

    ## Reading CSV File and naming the object called crime
ltweet=pd.read_csv("C:\\Users\\name\\Documents\\python assignment\\litecoin1.csv",index_col = None, skipinitialspace = True)
print(ltweet)

I already clean most of the data, so no need to put the codes for that part. In my column there are tweets that contains mostly non English language. I want to remove all of them(Non English text only). Here's the output for example

ltweet['Tweets'][0:3]

output:
0      the has published a book on understanding العَرَبِيَّة‎
1      accepts litecoin gives % discount on all iphon...
2      days until litepay launches accept store and s...
3           ltc to usd price litecoin ltc cryptocurrency

Is there a way to remove non English words in the data? Can anyone help me write the code for it? By the way, the code is based on Pandas.

Aziz Bokhari
  • 329
  • 1
  • 4
  • 14

1 Answers1

0

There has been a similar question here.

You could try enchant:

import enchant
d = enchant.Dict("en_US")
word = "Bonjour"
d.check(word)

This will return "False".

Do this for every word in the text:

english_words = []
for word in text:
    if d.check(word):
        english_words.append(word)

Edit: Watch out for words that appear in multiple languages.

lenngro
  • 110
  • 10
  • I am getting error while importing it, ERROR: Could not find a version that satisfies the requirement enchant (from versions: none) ERROR: No matching distribution found for enchant – Soumyaansh Sep 12 '19 at 05:29
  • You can still install the package using `pip install pyenchant`although the package is no longer maintained: https://github.com/rfk/pyenchant – lenngro Sep 12 '19 at 07:30
  • EDIT: ok i just updated pip and now I seem to get the same error. This maybe due to the project not being maintained anymore :( – lenngro Sep 12 '19 at 07:36