1

I have an Excel file, which has tens of thousands of English/Latin and Arabic words in two columns, first column's name: "EN", the another column's name: "AR". The column I want to work on is "AR" column.

I want to add 'ar' in a new column in front of each row containing only Arabic words, and add 'en' in front of each row contains only Latin vocabulary, and add 'enar' in front of each row contains Latin and Arabic vocabulary.

Note: numbers, point '.', comma ',' are used in all rows.

An example of my file, the work I want to do:

    EN                       AR                new column
    Appel                        تفاحة               ar
    Appel (1990)             (1990) تفاحة            ar
    R. Appel                 ر. تفاحة                ar
    Red, Appel               Red Appel                en
    Red Appel                Red Appel                en
    R. Appel                 R. Appel                 en
    Red, Appel               تفاحة، Red              enar
    Red Appel                Red تفاحة               enar

How can I do that using Python/Pandas?

Thank you guys for your help.

Charax
  • 41
  • 4
  • I thought I could finally put my Arabic degree to use and help out with this one, but I cant' figure out the regex to match both EN & AR to get en|ar. just out of interest what sort of analysis are you working on? also it seems ر is used to represent Red in your dataset. – Umar.H Nov 09 '19 at 16:26
  • https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language – Michael Gardner Nov 09 '19 at 16:29

2 Answers2

1

Here is a possible solution with a third party library called regex.

Code

import pandas as pd
import regex

data = {'AR':['    تفاحة ','(1990) تفاحة', 'ر. تفاحة', 'Red Appel', 'Red Appel', 'R. Appel', 'تفاحة، Red', 'Red تفاحة']}

df = pd.DataFrame(data)

df['is_arabic'] = df['AR'].apply(lambda t: True if regex.search(r'[^\p{Latin}\W]', t) else False)

df['is_latin'] = df['AR'].apply(lambda t: True if regex.search(r'[\p{Latin}a-zA-Z]', t) else False)

#assign 'enar', 'ar', 'en'
def myfunc(t):
    if t[0]&t[1]:
        return 'enar'
    elif t[0]:
        return 'ar'
    else:
        return 'en'

df['new_column'] = df[['is_arabic','is_latin']].apply(myfunc, axis=1)

Output

#print(df)
#              AR  is_arabic  is_latin new_column
# 0        تفاحة        True     False         ar
# 1  (1990) تفاحة       True     False         ar
# 2      ر. تفاحة       True     False         ar
# 3     Red Appel      False      True         en
# 4     Red Appel      False      True         en
# 5      R. Appel      False      True         en
# 6    تفاحة، Red       True      True       enar
# 7     Red تفاحة       True      True       enar
QuantStats
  • 1,448
  • 1
  • 6
  • 14
  • Thanks for your replying, how to read data through an Excel file, like that: data = 'words.xlsx' – Charax Nov 09 '19 at 18:17
  • @Charax You can create the df directly using `df=pd.read_excel('words.xlsx',encoding='utf-8')` or `df=pd.read_excel('words.xlsx',header=None,encoding='utf-8')`, depending on the structure of your excel file. – QuantStats Nov 09 '19 at 18:27
  • @Charax Just `df=pd.read_excel('words.xlsx',encoding='utf-8')` or `df=pd.read_excel('words.xlsx',header=None,encoding='utf-8')`. Don't do `data=pd.read_excel('words.xlsx',encoding='utf-8')`. You don't need `df = pd.DataFrame(data)` too. `data` was created to demonstrate an example, it wouldn't be needed in your case, you will be operating directly on the `df`. – QuantStats Nov 09 '19 at 18:37
  • @Charax Call print(df.head()) for me to see. You're getting the error because you don't have 'AR' as one of the columns. – QuantStats Nov 09 '19 at 18:46
  • @QuantStats It seems unnecessarily complicated to write `.apply(lambda t: myfunc(t), ...)` instead of just `.apply(myfunc, ...)`. – lenz Nov 10 '19 at 12:45
  • @lenz Edited. You're right, didn't need to complicate that. – QuantStats Nov 10 '19 at 16:28
0

I think you can use this package TextBlob to define your new column, first, you should install TextBlob package then your code will be like this:

from textblob import TextBlob

def detect_language(text):
    diff_lang = []

    for word in text.split():
        diff_lang.append(TextBlob(word).detect_language())

    diffrent_language_count = len(list(set(diff_lang)))

    if diffrent_language_count > 1 :
        return("enar")
    elif :
        return (diff_lang[0])

df ['new column'] = df['AR'].apply(lambda txt: detect_language(txt))
yasi
  • 397
  • 1
  • 4
  • 14