I am doing a sentiment analysis project in Python (using Natural Language Processing). I already collected the data from twitter and saved it as a CSV file. The file contains tweets, which are mostly about cryptocurrency. I cleaned the data but there is one more thing before I apply sentiment analysis using classfication algorithms. Here's the out for importing libraries
# importing Libraries
from pandas import DataFrame, read_csv
import chardet
import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib import rc
%matplotlib inline
import pandas as pd
plt.style.use('ggplot')
import numpy as np
import re
import warnings
#Visualisation
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from IPython.display import display
from mpl_toolkits.basemap import Basemap
from wordcloud import WordCloud, STOPWORDS
#nltk
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
matplotlib.style.use('ggplot')
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore")
%matplotlib inline
## Reading CSV File and naming the object called crime
ltweet=pd.read_csv("C:\\Users\\name\\Documents\\python assignment\\litecoin1.csv",index_col = None, skipinitialspace = True)
print(ltweet)
I already clean most of the data, so no need to put the codes for that part. In my column there are tweets that contains mostly non English language. I want to remove all of them(Non English text only). Here's the output for example
ltweet['Tweets'][0:3]
output:
0 the has published a book on understanding العَرَبِيَّة
1 accepts litecoin gives % discount on all iphon...
2 days until litepay launches accept store and s...
3 ltc to usd price litecoin ltc cryptocurrency
Is there a way to remove non English words in the data? Can anyone help me write the code for it? By the way, the code is based on Pandas.