0

I am currently dealing with 2 csv files where 1 contains 5 columns - year, month, day, hour, and Revenue and I want to transcribe the Revenue column to another csv file, which also has the year, month, day, and hour columns. For example, I want to add a 'Revenue' column on the second csv file and place the revenue value for 2020/05/26 17:00 to the Revenue column on the first csv with the same date and hour.

Example of the first csv

year, month, day, hour, revenue
2020,3,5,8,100
2020,3,5,9,100
2020,3,5,11,100

Example of the second csv

year, month, day, hour, revenue
2020,3,5,8,
2020,3,5,9,
2020,3,5,10,
2020,3,5,11,
2020,3,5,12,

The resulting csv I want to get

year, month, day, hour, revenue
2020,3,5,8,100
2020,3,5,9,100
2020,3,5,10,
2020,3,5,11,100
2020,3,5,12,

I checked Pandas Merging 101 but it was talking about when 1 column matches while I'm checking if all 4 columns match. Also, this will be a large csv of around 10,000 rows.

Thank you in advance.

asdfg
  • 57
  • 4

1 Answers1

1

The on parameter of merge can take a list of columns:

df3 = df1.merge(df2, on=['year', 'month', 'day', 'hour'], how='right')

The default behaviour of merge is to merge on the "intersection of the columns in both DataFrames" so in this particular case the default merge is equivelent:

df3 = df1.merge(df2, how='right')

If df2 has a column called revenue with no meaningful values drop that before the merge:

df3 = df1.merge(df2.drop(columns='revenue'), how='right')

df3:

   year  month  day  hour  revenue
0  2020      3    5     8    100.0
1  2020      3    5     9    100.0
2  2020      3    5    10      NaN
3  2020      3    5    11    100.0
4  2020      3    5    12      NaN

Write result to csv with to_csv:

df1.merge(df2, how='right').to_csv('out.csv', index=False)

out.csv:

year,month,day,hour,revenue
2020,3,5,8,100.0
2020,3,5,9,100.0
2020,3,5,10,
2020,3,5,11,100.0
2020,3,5,12,

Complete Working Example:
import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    'year': [2020, 2020, 2020], 'month': [3, 3, 3], 'day': [5, 5, 5],
    'hour': [8, 9, 11], 'revenue': [100, 100, 100]
})

df2 = pd.DataFrame({
    'year': [2020, 2020, 2020, 2020, 2020], 'month': [3, 3, 3, 3, 3],
    'day': [5, 5, 5, 5, 5], 'hour': [8, 9, 10, 11, 12],
    'revenue': [np.nan, np.nan, np.nan, np.nan, np.nan]
})

df1.merge(df2.drop(columns='revenue'), how='right') \
    .to_csv('out.csv', index=False)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Ohh I see! Thank you very much. – asdfg May 30 '21 at 22:33
  • With how the original data was presented, the second CSV had a column called 'revenue' as well, with all blank values. If this column is pulled into the data-frame, then the result ends up different, with separate 'revenue_x' and 'revenue_y' columns (what pandas will call them by default). One of those is just the 'revenue' from df1 and the other is 'revenue' from df2. This answer has the df2 just not having a revenue column at all, which resolves that. If df2 will always be null for revenue, then the read_csv would just have to not bring in that column. – scotscotmcc May 30 '21 at 22:38