How to remove Non English words in Python?

Question

I am doing a sentiment analysis project in Python (using Natural Language Processing). I already collected the data from twitter and saved it as a CSV file. The file contains tweets, which are mostly about cryptocurrency. I cleaned the data but there is one more thing before I apply sentiment analysis using classfication algorithms. Here's the out for importing libraries

# importing Libraries
from pandas import DataFrame, read_csv
import chardet
import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib import rc
%matplotlib inline
import pandas as pd
plt.style.use('ggplot')
import numpy as np
import re
import warnings

#Visualisation
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from IPython.display import display
from mpl_toolkits.basemap import Basemap
from wordcloud import WordCloud, STOPWORDS

#nltk
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer


matplotlib.style.use('ggplot')
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore")

%matplotlib inline

    ## Reading CSV File and naming the object called crime
ltweet=pd.read_csv("C:\\Users\\name\\Documents\\python assignment\\litecoin1.csv",index_col = None, skipinitialspace = True)
print(ltweet)

I already clean most of the data, so no need to put the codes for that part. In my column there are tweets that contains mostly non English language. I want to remove all of them(Non English text only). Here's the output for example

ltweet['Tweets'][0:3]

output:
0      the has published a book on understanding العَرَبِيَّة‎
1      accepts litecoin gives % discount on all iphon...
2      days until litepay launches accept store and s...
3           ltc to usd price litecoin ltc cryptocurrency

Is there a way to remove non English words in the data? Can anyone help me write the code for it? By the way, the code is based on Pandas.

You can remove everything not using the Latin alphabet, but for the rest, are you prepared to remove all English misspellings too? — Arndt Jonasson, Mar 27 '18 at 10:51
Same question answered here [removing-the-non-english-data](https://stackoverflow.com/questions/62602646/removing-the-non-english-data) — FEldin, Aug 14 '20 at 12:53
Do you think "Litecoin" and "USD" are "English"? What about "LTE" and "%"? — tripleee, Feb 26 '22 at 10:15

lenngro · Answer 1 · 2018-03-27T11:08:41.633

0

There has been a similar question here.

You could try enchant:

import enchant
d = enchant.Dict("en_US")
word = "Bonjour"
d.check(word)

This will return "False".

Do this for every word in the text:

english_words = []
for word in text:
    if d.check(word):
        english_words.append(word)

Edit: Watch out for words that appear in multiple languages.

edited Mar 27 '18 at 11:08

answered Mar 27 '18 at 11:00

lenngro

110
10

I am getting error while importing it, ERROR: Could not find a version that satisfies the requirement enchant (from versions: none) ERROR: No matching distribution found for enchant – Soumyaansh Sep 12 '19 at 05:29
You can still install the package using `pip install pyenchant`although the package is no longer maintained: https://github.com/rfk/pyenchant – lenngro Sep 12 '19 at 07:30
EDIT: ok i just updated pip and now I seem to get the same error. This maybe due to the project not being maintained anymore :( – lenngro Sep 12 '19 at 07:36

How to remove Non English words in Python?

1 Answers1