0

I like working with pandas due to my affinity to tidyverse in R when dealing with tables. I have a table of about 200,000 rows and need to replace punctuations and extract non-English words, and put it another column named non_english in the same table. I prefer using enchant library because I found it more accurate than using nltk library. My dummmy table df has dundee column which I am working on. A dummy data is as thus:

df = pandas.DataFrame({'dundee':    ["I love:Marae", "My Whanau is everything",  "I love Matauranga", "Tāmaki Makaurau is Whare", "AOD problem is common"]})

My idea is to remove punctuation first, write a function to extract non-english words, and then apply the function to the dataframe, but I got this error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. Here is my code:

import pandas as pd
import enchant
import re
import string

# remove punctuations
df['dundee1'] = df['dundee'].str.replace(r'[^\w\s]+', ' ')

# change words to lower case
df['dundee1'] = df['dundee1'].str.lower()


# Function to check if a word is english
def check_eng(word):
    
    # use all available english dictionary
    en_ls = ['en_NZ', 'en_US', 'en_AU', 'en_GB']
    en_bool = False
            
    # check all common dictionaries if word is English 
    for en in en_ls:
        dic = enchant.Dict(en)
        if word != '':
            if dic.check(word) == True:
                en_bool = True
                break

    disp_non_en = ""
    word = word.str.split(' ')

    if len(word) != 0:
        if en_bool == False:
             disp_non_en = disp_non_en + word + ', '

    return disp_non_en

df['non_english'] = check_eng(df['dundee1'])

The desired table is this:

    dundee                          non_english
0   I love:Marae                    Marae
1   My Whanau is everything         Whanau
2   I lov Matauranga                love, Matauranga
3   Tāmaki Makaurau is Whare        Tāmaki Makaurau, Whare
4   AOD problem is common           AOD
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

2 Answers2

0

The error is related to the call:

check_eng(df['dundee1'])

where df['dundee1'] is of type Series and you have an if statement trying to elicit the Boolean value of:

 if word != '':

word is a Series so you should rather use:

df['dundee1'].apply(check_eng)

instead.

There is one more issue in the check_eng:

Instead of:

 if len(word) != 0:
        if en_bool == False:
             disp_non_en = disp_non_en + word + ', '

You should rather use:

words = word.str.split(' ')
for word in words:
    if en_bool == False:
        disp_non_en = disp_non_en + word + ', '

because you do:

word = word.str.split(' ')

which changes the type of word from str to list and makes the if invalid.

You may want to review some other aspects of the error: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

sophros
  • 14,672
  • 11
  • 46
  • 75
  • Thanks @sophos. I have changed ```check_eng(df['dundee1'])``` to ```f['dundee1'].apply(check_eng)```, but there is an error ```AttributeError: 'str' object has no attribute 'str'``` – Bible Stands Aug 25 '21 at 07:55
  • @WiktorStribiżew: sorry, you are wrong here. The issue is in the type of `word` that changes in the function. Pointed it out in the answer now. – sophros Aug 25 '21 at 08:55
0

remove str from word.str.split(' '), it will work fine. Try this: words = word.split(' ')