2

This might sound like a difficult question that requires multiple sophisticated operations but please hear me out. I appreciate LOTS if anyone could help me with this. I have df1 with names and duplicated country name and a list of keywords to be translated and added separately to the names. In my real df, I have more than a thousand elements in the 'name' column and countries are duplicated. I have about six keywords to be translated. The df and list below are samples. Thanks!!

l=['spring','summer','fall','winter']

df1

'Name'  'Country'
Tom      United States
Sam      French
Tim      China
Andrew   Japan
Bess     Turkey
Sara     Romania

My goal is to create a df2 that looks like this:

'New Column'
Tom spring
Tom summer
Tom autumn
Tom winter
Sam printemps
Sam été
Sam lautomne
Sam hover
.
.
.
Sara primăvară
Sara vară
Sara toamnă
Sara iarnă

Steps to consider:

  1. Detect the language to be translated into by using the 'Country column' to get the Google translate language code
  2. Translate the keywords into designated language
  3. One by one, concat the translated keywords at the end of the name separated by a space
  4. Put the outputted strings (Name + translated keyword) into one column as seen in df2

Thank you for having your time to read through my questions. I would appreciate very much if anyone could offer any help!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Matthias Gallagher
  • 475
  • 1
  • 7
  • 20

3 Answers3

1

Building on 13f23f3f answer. I think googletrans (free) and the the language dictionary is a good start.

To start off the dictionary langCodes can be expanded to support multiple languages from one country. You can't just define "Switzerland":"French" and "Switzerland":"German" but instead you can use an array of all possible languages. For example:

langCodes = {
"United States":["en"],
"France":["fr"], 
"Romania":["ro"],
"Switzerland":["de", "fr"] 
}

With langCodes being an dictionary with an array of all possible languages for each country. With that being said here is a full example using pandas and googletrans

import pandas as pd 
from googletrans import Translator
translator = Translator() #init translator  

# proxies uncomment to use proxies (for large amount of translations)
# if these do not work you might need to find other html proxies online 
#proxiesArray = [{'http':"134.122.19.151:3128"},{'http':"68.183.115.230:8080"},{'http':'104.129.196.153:10605'},{'http':'35.230.21.108:80'}]


#word list 
l=['spring','summer','fall','winter']

#langCodes More can be added by looking at the googletrans documentation
langCodes = {
"United States":["en"],
"France":["fr"],
"China":["zh-cn"],
"Japan":["ja"],
"Turkey":["tr"],
"Romania":["ro"],
"Switzerland":["de", "fr"] 
}

#df1 names and countries using pandas 
df1 = pd.DataFrame([["Tom","United States"],
["Sam","France"],
["Tim","China"],
["Andrew","Japan"],
["Bess","Turkey"],
["Sara","Romania"],
["Jeff","Switzerland"]],
columns=["Name","Country"])  

#df2 initialize 
df2 = pd.DataFrame(columns=["New Column"])

#iterate through the rows of df1
for idx, row in df1.iterrows():
    #iterate through the possible languages
    for lang in langCodes[row['Country']]: 
        #iterate through the possible words
        for word in l: 
            #translate the word using googletrans
            getTrans = translator.translate(word, dest=lang).text 
            #proxies to use comment the line above and uncomment the two lines below
            #proxyIdex = idx % len(proxiesArray)
            #getTrans = translator.translate(word, proxy = proxiesArray[proxyIdex],dest=lang).text 
            #append output to new column 
            df2 = df2.append({"New Column":row['Name']+" "+getTrans},ignore_index=True)

print(df2)

Sample output:

        New Column
0       Tom spring
1       Tom summer
2         Tom fall
3       Tom winter
4    Sam printemps
5          Sam été
6       Sam tomber
7        Sam hiver
8           Tim 弹簧
9           Tim 夏季
10          Tim 秋季
11          Tim 冬季
12        Andrew 春
13        Andrew 夏
14        Andrew 秋
15        Andrew 冬
16      Bess bahar
17        Bess yaz
18   Bess sonbahar
19        Bess kış
20  Sara primăvară
21       Sara vară
22     Sara toamna
23      Sara iarnă
24   Jeff Frühling
25    Jeff Sommer-
26       Jeff fall
27     Jeff winter
28  Jeff printemps
29        Jeff été
30     Jeff tomber
31      Jeff hiver

As you can see "Jeff" has both German and French responses. Additionally if the input list is very large you can consider using hyper as it can speed up translation according to googletrans Documentation.

Update I added proxies to the answer for large translations. To use uncomment the specific lines in the code. Beware that using proxies slows down the translations but when I tested it with 6 l words and 1500 entries in df1 it completed without error. The more proxies added in proxiesArray should increase the translation capacity.

mcmanetta
  • 11
  • 2
  • I tried with this and it keeps throwing me 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'. I read that it is due to Google API limit from https://stackoverflow.com/questions/49497391/googletrans-api-error-expecting-value-line-1-column-1-char-0. But I tried re-initialising the translator in the three for loops. It still does not work. Any insights on solving this error? – Matthias Gallagher Jul 16 '20 at 03:57
  • @MatthiasGallagher how many words are you trying to translate? I think the error could be caused by a large number of requests. To solve this (if the data is not sensitive) you can split the data then run it through a proxy. Like "translator = Translator(proxies={'http':"134.122.19.151:3128"})". Random proxies can be found [http://free-proxy.cz/en/proxylist/region/California/all/ping](http://free-proxy.cz/en/proxylist/region/California/all/ping). – mcmanetta Jul 16 '20 at 04:44
  • I have six elements in the word list and more than a thousand elements in the name column – Matthias Gallagher Jul 16 '20 at 05:19
  • @marc_s I have six elements in the word list and more than a thousand elements in the name column – Matthias Gallagher Jul 16 '20 at 06:51
0

Here's a way you can do, hope this gives you idea:

# add the list column to each row
df['kw'] = [l for _ in range(len(df))]

# convert to each new row
df = df.explode('kw')

# call the google api
df['trans'] = df.apply(lambda x: google_api.search(text = x['Name'], country = x['Country']), axis=1)

# paste the row side by side 
df['new_columns'] = df[['Name', 'trans']].agg('-'.join, 1)
YOLO
  • 20,181
  • 5
  • 20
  • 40
0

To start off with the translation you want to do can be accomplished by googletrans a python package that allows google translation.

such that:

from googletrans import Translator
translator = Translator()
translation = translator.translate('SomeWord', dest='ja')

However you run into ambiguity since many countries have multiple languages. Say Switzerland could be French, German etc. So you could create a dictionary of countries to language codes:

langCodes = {
"United States":"en",
"France":"fr", 
"Romania":"ro" 
}

At this point you could iterate through the data set using the langCodes dictionary to convert to language codes and googletrans to translate.

  • This sounds like a great start! Can you elaborate more on the iteration part? Sorry I am quite new to python and coding. If I create three languages for one country, say like "Switzerland":"French" and "Switzerland":"German", how will it turn out? – Matthias Gallagher Jul 15 '20 at 04:37