5

While ago I was interviewed for a Data Scientist role. Strangely, without asking about Machine Learning or Data Science or even Statistics, I was given a small task to join two pandas dataframes, and compare various methods for doing so. I was not given a criteria that what the expectation was. I've provided multiple solutions. Surprisingly I was told afterwards none of my solutions meet the performance benchmarks of other solutions they have for this task! Obviously I asked for a feedback or what other method(s) they use for this task that outperform my provided solutions, but so answer; not just yet. I consider myself an intermediate Python programmer and certainly I am not aware of many tricks or best practices, and I have not much paid attention to the performance so far, unless it was super obvious slow. That is why, since the interview, it got me thinking what other ways is to accomplish this task in the fastest way possible.

The problem:

# Randomly generated historical data about how many megabytes were downloaded from the Internet."HoD" is the Hour of the Day!
hist_df = pd.DataFrame(columns=['HoD', 'Volume'])
hist_df['HoD'] = np.random.randint(0, 24, 365 * 24)
hist_df['Volume'] = np.random.uniform(1, 1000, 365 * 24)

# Tariffs based on the hour of the day
tariffs_df = pd.DataFrame({
    'Time range': ['00:00 to 09:00', '09:00 to 18:00', '18:00 to 00:00'],
    'cost': [10, 14, 22]
})

Task: Return the historical dataframe with an additional column “cost” that will show how much money was spent for every hour in the historical data. Basically tariff dataframe need to be merged to the historical data.

MY Solutions: Here is the gist containing the four methods I have provided. I have provided merging based on (slowest to fastest) i) simple iterrows (slowest), ii) pandas_apply, iii) and numpy vectorize, and iv) pandas binning (fastest). The results are (in seconds):

{'naive_iterrows': 5.810565948486328,
 'pandas_apply': 0.6743350028991699,
 'numpy_vec': 0.6381142139434814,
 'pandas_bin-merge': 0.009788990020751953}

enter image description here

Question: What faster methods are out there to achieve this? As I mentioned in the first paragraph, they were not happy with the performance of my solutions.

P.S.: Although this matter is very subjective, but I find it bizarre that they post a Data Scientist role (with many machine learning blah blah requirement), and yet to reject an applicant because of this. I am super glad I came to know what their job requirements and expectations beforehand. Still love learning more about pandas, python best practices, for this particular case if there are any others!

TwinPenguins
  • 475
  • 9
  • 17
  • I understand what you are describing, but what is your question in the end exactly? – Erfan Apr 23 '19 at 09:06
  • I have a curiosity question: did you have to do it live in front of your interviewer or did you have time to think about at home? – ggrelet Apr 23 '19 at 09:13
  • Well, I thought I was clear when I was saying "Surprisingly I was told afterwards none of my solutions meet the performance benchmarks of other solutions they have for this task! ". They told there are faster ways to do achieve this, and I am wondering what other methods are out there! I will change the question that this is clear. – TwinPenguins Apr 23 '19 at 09:14
  • 1
    @ggrelet It was not live. I was told though not to spend more than 2 hr on it. The task was a bit more than this (e.g. wiring a proper Python class that does this or that, one of the method of the class was comparing various pandas). I ended up providing my solutions in 2 and a half hr actually! – TwinPenguins Apr 23 '19 at 09:26
  • 1
    I see the point. While your method #4 seems to work well, I think you can gain more performance by efficiently using pandas `join` ? At least this is what I can gather from the post here: https://stackoverflow.com/questions/40860457/improve-pandas-merge-performance - trying it myself though doesn't yield in any improvement whatsoever, since there is still the binning step which negates the time advantage. – Siddharth Apr 23 '19 at 09:32
  • @SiddharthI was not aware of the join by setting index. Excellent. – TwinPenguins Apr 23 '19 at 10:23

1 Answers1

2

You can change the last method to join as was suggested by Siddharth. Suppose your DataFrame is much larger:

hist_df = pd.DataFrame(columns=['HoD', 'Volume'])
hist_df['HoD'] = np.random.randint(0, 10000, 365 * 10000)
hist_df['Volume'] = np.random.uniform(1, 10000, 365 * 10000)

Let's compare:

%timeit merged_df = pd.merge(hist_df, tariffs_df, how='left', left_on='bin', right_on='Time range')

1 loop, best of 3: 740 ms per loop


%timeit hist = hist_df.set_index('bin')
%timeit tariffs = tariffs_df.set_index('Time range')
%timeit merged_df = hist.join(tariffs)

10 loops, best of 3: 20.1 ms per loop
1000 loops, best of 3: 449 µs per loop
100 loops, best of 3: 3.59 ms per loop

If other packages are allowed, you can also use dask for parallelization.

DmytroSytro
  • 579
  • 4
  • 15
  • Thanks a lot for your answer. I was not aware of this kind of joins. Why joining on indexes are way faster? Please let me try it from my end before accepting your answer! – TwinPenguins Apr 23 '19 at 10:25
  • 1
    That's because `join` joins to index, and `merge` operates on columns by default. You can also merge by index by specifying parameter. Look at this answer: https://stackoverflow.com/questions/50970859/merging-dataframes-on-an-index-is-more-efficient-in-pandas – DmytroSytro Apr 23 '19 at 12:18
  • Excellent, thanks for the comment, and thanks for the link. I did not know "DataFrame's index is backed by a hash table."!! I guess I'm learning everyday. I only get to see the performance gain over large datasets as you nicely mentioned. With small data set I was benchmarking, the gain was marginal. – TwinPenguins Apr 23 '19 at 13:55