0

I have a dataframe which consists of 6K items, each item having 70 fields to describe it. So the length of my dataframe is around 420K rows. Then I apply my function like this:

df_dirty[['basic_score', 'additional_score']] = df_dirty.apply(compare.compare, axis=1)

Compare function takes the row from df_dirty and then takes and ID from that row, depending on which it takes other two cells from that row and performs a comparison of those two cells. The comparison may be a simple

   if cell1 == cell2:
      return True
   else:
      return False

or a more difficult calculation that takes the values of those cells and then calculates if their ratio is in some range or whatever. Overall - the function I apply to my dataframe is performing some more actions so it's very time consuming for large datasets of complex data(not only clean numbers, but number and text combinations, etc).

I was wondering if there are any faster ways to do this than simply applying a function?

I have some ideas about what should I do with this:

Put everything on a server and perform all calculations overnight, so it would be faster to just ask for an already calculated result, also I thought that this would maybe be faster if I used C to write my compare function. What are my other options?

milka1117
  • 521
  • 4
  • 8
  • 17
  • Could you give us the code of the compare function as well please – Mayeul sgc Jul 31 '19 at 09:37
  • take a look at : https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column – anky Jul 31 '19 at 09:45
  • Use pandas built-in vectorized methods instead of `apply` with custom functions. See [here](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6) and [here](https://realpython.com/fast-flexible-pandas/) for detailed explanation. – Xukrao Jul 31 '19 at 09:46

0 Answers0