I have a dataframe which consists of 6K items, each item having 70 fields to describe it. So the length of my dataframe is around 420K rows. Then I apply my function like this:
df_dirty[['basic_score', 'additional_score']] = df_dirty.apply(compare.compare, axis=1)
Compare function takes the row from df_dirty
and then takes and ID
from that row, depending on which it takes other two cells from that row and performs a comparison of those two cells. The comparison may be a simple
if cell1 == cell2:
return True
else:
return False
or a more difficult calculation that takes the values of those cells and then calculates if their ratio is in some range or whatever. Overall - the function I apply to my dataframe is performing some more actions so it's very time consuming for large datasets of complex data(not only clean numbers, but number and text combinations, etc).
I was wondering if there are any faster ways to do this than simply applying a function?
I have some ideas about what should I do with this:
Put everything on a server and perform all calculations overnight, so it would be faster to just ask for an already calculated result,
also I thought that this would maybe be faster if I used C to write my compare
function. What are my other options?