1

I'm currently trying to normalize a DataFrame(~600k rows) with prices (pricevalue) in different currencies(pricecurrency) so that every row has prices in EUR.

I'd like to convert them with the daily rate taken from a column date.

My current "solution" (using the CurrencyConverter package found on PyPI) looks like this:

from currency_converter import CurrencyConverter

c = CurrencyConverter(fallback_on_missing_rate=True,fallback_on_missing_rate_method="last_known")

def convert_currency(row):
     return c.convert(row["pricevalue"], row["pricecurrency"],row["date"])

df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)

However, this solution is taking forever to run.

Is there a faster way to accomplish that? Any help is appreciated :)

PStein
  • 38
  • 5
  • does this helps: https://stackoverflow.com/a/51669468/6660373? – Pygirl May 06 '20 at 13:36
  • Do you have the YOURCOIN>EUR daily price in another sheet or is this CurrencyConverter package getting i for you? If so, it might explain the long time to do it. – mrbTT May 06 '20 at 13:38
  • Are these current prices or historic prices? What is the range of dates that a price could be in? – Dave May 06 '20 at 14:06
  • @mrbTT The CurrencyConverter Package is getting the prices for me – PStein May 06 '20 at 14:14
  • @Pygirl Will check out the post, thank you. – PStein May 06 '20 at 14:14
  • @Dave the prices are historic prices. The prices are ranging from ~2008 - 2018, so i need some form of lookup of the exchange rate, which is provided by the CurrencyConverter Package mentioned. – PStein May 06 '20 at 14:15
  • @Dave, roughly 600.000 rows as mentioned in the post. – PStein May 06 '20 at 14:20
  • The issue might be that CurrencyConverter needs to get the values online, and it's API has a limited number of queries or is slow to fetch results. – mrbTT May 06 '20 at 18:44

1 Answers1

0

It sounds strange to say this, but unfortunately you're not doing anything wrong!

The currency interpolation code is doing what you need it to do, and not much else. In your code, you're doing everything right. This means there's no thing you can quickly fix to get performance. You have a double lambda where you only need a single, but that won't make much of a difference:

i.e.

df["converted_eur"] = df.apply(lambda x: convert_currency(x),axis=1)

should be

df["converted_eur"] = df.apply(convert_currency, axis=1)

The first thing to do is to understand how long this processing will actually take by adding some UI:

from tqdm import tqdm

df["converted_eur"] = df.progress_apply(convert_currency, axis=1)

Once you know how long the job will actually take, try out these, in order:

  1. Live with it.
  2. Single instance parallelization, with something like pandarallel
  3. Multi instance parallelization, with something like Dask
Dave
  • 1,579
  • 14
  • 28
  • Thanks for the detailed answer and pointing out further readings! Will definitely take a look at option 2 and 3. tqdm is btw a neat little tool i'll add to my toolbox! :) – PStein May 06 '20 at 21:06
  • Happy to help! Let us know what you go with in the end. – Dave May 06 '20 at 21:08