2

I am trying to translate words from a Pandas dataframe column of ca. 200000 rows in length. It looks like this:

    df =| review  | rating | 
        | love it | 5      |
        | hate it | 1      |
        | its ok  | 3      |
        | great   | 4      | 

I am attempting to translate this into a different language using googletrans, and I have seen some solutions using df.apply to apply the function to each row, however it is painfully slow in my case (roughly 16 hours needed to translate the whole column).

However googletrans do support batch translations where it takes a list of strings as an argument instead of just a single string.

I have been looking for a solution which takes advantage of this and my code looks like this:

from googletrans import Translator
translator = Translator()
list1 = df.review.tolist()
translated = []
for i in range(0,len(df),50)):
    translated.extend([x.text for x in translator.translate(list1[i:i+50], src='en' , dest='id')])
df['translated_review'] = translated #add back to df

But it is still as slow. Could anyone shed some light on how to further optimise this?

mac13k
  • 2,423
  • 23
  • 34
asdj1234
  • 77
  • 1
  • 6

1 Answers1

0

Perhaps you could try to reshape the column with words as a numpy.array instead, ie.:

translated = []
for row in df.review.values.reshape((-1, 50)):
    translated.append(translator.translate(row, src='en', dest='id'))

Note that the length of the df.review series must be divisible by 50 for the reshape method to work. If it is not either choose another value or trim the series to the size that is a multiple of 50.

A further improvement would be to parallelize the translations. For that you should look into parallel processing in Python, ie. 1, 2.

mac13k
  • 2,423
  • 23
  • 34
  • got a ValueError: cannot reshape array of size 209729 into shape (50) – asdj1234 Jul 23 '20 at 20:45
  • What an interesting number: 209729 = 13*13*17*73, so you can use one of the divisors (ie. 17 or 73) or their product (ie. 169) for the number of columns instead of 50. – mac13k Jul 24 '20 at 08:39