I like working with pandas
due to my affinity to tidyverse
in R
when dealing with tables. I have a table of about 200,000 rows and need to replace punctuations and extract non-English words, and put it another column named non_english
in the same table. I prefer using enchant
library because I found it more accurate than using nltk
library. My dummmy table df
has dundee
column which I am working on. A dummy data is as thus:
df = pandas.DataFrame({'dundee': ["I love:Marae", "My Whanau is everything", "I love Matauranga", "Tāmaki Makaurau is Whare", "AOD problem is common"]})
My idea is to remove punctuation first, write a function to extract non-english words, and then apply the function to the dataframe, but I got this error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
. Here is my code:
import pandas as pd
import enchant
import re
import string
# remove punctuations
df['dundee1'] = df['dundee'].str.replace(r'[^\w\s]+', ' ')
# change words to lower case
df['dundee1'] = df['dundee1'].str.lower()
# Function to check if a word is english
def check_eng(word):
# use all available english dictionary
en_ls = ['en_NZ', 'en_US', 'en_AU', 'en_GB']
en_bool = False
# check all common dictionaries if word is English
for en in en_ls:
dic = enchant.Dict(en)
if word != '':
if dic.check(word) == True:
en_bool = True
break
disp_non_en = ""
word = word.str.split(' ')
if len(word) != 0:
if en_bool == False:
disp_non_en = disp_non_en + word + ', '
return disp_non_en
df['non_english'] = check_eng(df['dundee1'])
The desired table is this:
dundee non_english
0 I love:Marae Marae
1 My Whanau is everything Whanau
2 I lov Matauranga love, Matauranga
3 Tāmaki Makaurau is Whare Tāmaki Makaurau, Whare
4 AOD problem is common AOD