3
import time
start = time.time()
import pandas as pd
from deep_translator import GoogleTranslator
    
data = pd.read_excel(r"latestdata.xlsx")
translatedata = data['column']. fillna('novalue')
    
list = []
for i in translatedata:
      finaldata = GoogleTranslator(source='auto', target='english').translate(i)
      print(finaldata)
      list.append(finaldata)
    
df = pd.DataFrame(list, columns=['Translated_values'])
df.to_csv(r"jobdone.csv", sep= ';')
    
end = time.time()

print(f"Runtime of the program is {end - start}")

I have data of 220k points and trying to translate a column data At first I tried to use pool method parallel program but got an error that I can not access API several time at once. My question is if there is other way to improve performance of code that I have right now.

# 4066.826668739319     with just 10000 data all together.
# 3809.4675991535187    computation time when I run in 2 batch's of 5000
user3666197
  • 1
  • 6
  • 50
  • 92
Anna
  • 33
  • 5

1 Answers1

1

Q :
" ... is ( there ) other way to improve performance of code ...? "

A :
Yes, there are a few ways,
yet do not expect anything magical, as you have already reported the API-provider's throttling/blocking somewhat higher levels of concurrent API-call from being served

There still might be some positive effects from latency-masking tricks from a just-[CONCURRENT] orchestration of several API-calls, as the End-to-End latencies are principally "long" as going many-times across the over-the-"network"-horizons and having also some remarkable server-side TAT-latency on translation-matching engines.

Details matter, a lot...

A performance boosting code-template to start with
( avoiding 220k+ repeated local-side overheads' add-on costs ) :

import time
import pandas as pd
from   deep_translator import GoogleTranslator as gXLTe
    
xltDF = pd.read_excel( r"latestdata.xlsx" )['column'].fillna( 'novalue' )
resDF = xltDF.copy( deep = True )

PROC_ns_START = time.perf_counter_ns()
#________________________________________________________ CRITICAL SECTION: start
for                  i in range( len( xltDF ) ):
         resDF.iloc( i ) = gXLTe( source = 'auto',
                                  target = 'english'
                                  ).translate( xltDF.iloc( i ) )

#________________________________________________________ CRITICAL SECTION: end
PROC_ns_END = time.perf_counter_ns()

resDF.to_csv( r"jobdone.csv",
              sep = ';'
              )

print( f"Runtime was {0:} [ns]".format( PROC_ns_END - PROC_ns_START ) )

Tips for performance boosting :

  • if Google API-policy permits, we may increase thread-count, that participate on CRITICAL SECTION,
  • as the Python-interpreter threads are "inside" the same address-space and still are GIL-lock MUTEX-blocked, we may operate all just-[CONCURRENT] accesses to the same DataFrame-objects, best using non-overlapping, separate (thread-private) block-iterators over disjunct halves ( for a pair of threads ) over disjunct thirds ( for 3 threads ) etc...
  • as the Google API-policy is limiting attempts to overly concurrent access to the API-service, you shall build-in some, even naive-robustness
def thread_hosted_blockCRAWLer( i_start, i_end ):
    for i in range( i_start, i_end ):
        while True:
              try:
                  resDF.iloc( i ) = gXLTe( source = 'auto',
                                           target = 'english'
                                           ).translate( xltDF.iloc( i ) )
                  # SUCCEDED
                  break
              except:
                  # FAILED
                  print( "EXC: _blockCRAWLer() on index ", i )
                  time.sleep( ... )
                  # be careful here, not to get on API-provider's BLACK-LIST
                  continue
  • if more time-related details per thread, may reuse this

Do not hesitate to go tuning & tweaking - and anyway, keep us posted how fast you managed to get, that's fair, isn't it?

user3666197
  • 1
  • 6
  • 50
  • 92