0

I'd like to add supplementary information from df2 to df1 by matching date data.

df1 is the main dataframe:

            x0      x1      x2      x3      x4      x5      ...  x10000  Date       
1           40      31.05   25.5    25.5    25.5    25      ...    33    2013-11-13
2           35      35.75   36.5    36.5    36.5    36.5    ...    29    2013-09-05
⋮           ⋮       ⋮        ⋮       ⋮       ⋮        ⋮               ⋮

df2 is the supplementary weather info I want to add to df1:

year month day  maxtemp mintemp rainfall    wind 
2013    1   1   26.2    20.2     0          32.4
2013    1   2   22.9    20.3     0          10
2013    1   3   24.8    18.4     0          28.8
2013    1   4   26.6    18.3     0          33.5
2013    1   5   28.3    20.9     0          33.4
2013    1   6   28      21.6     0          32.8
2013    1   7   27.5    21.4     0          26.8
2013    1   8   42.3    20.9     0          25.5
2013    1   9   25      21.1     0          20.9
2013    1   10  25.4    20.2     0          14
⋮       ⋮    ⋮   ⋮        ⋮        ⋮           ⋮

I need to add preceding 100 days of maxtemp,mintemp,rainfall and wind data extracted from df2, to the end of the each row on df1 horizontally, by matching with year, month, day of Date in df1. So Date is the 100th day and the previous 99 days are the 99 days before Date.

Expected output:

     x0  x1    x2   x3   x4   x5   ... x10000 Date       max_t1...max_t100 min_t1...min_t100 rf1... rf100 w1 ... w100
1    40  31.05 25.5 25.5 25.5 25   ...  33    2013-01-01 26.2  ...         20.2  ...          0 ...       32.4...  
2    35  35.75 36.5 36.5 36.5 36.5 ...  29    2013-01-03 24.8. ...         18.4  ...          0 ...       28.8
⋮     ⋮   ⋮      ⋮    ⋮    ⋮     ⋮          ⋮

where

max_t1, ..., max_t100 represent max temperature from day1 to day100(`Date` day);
min_t1, ..., min_t100 represent min temperature from day1 to day100(`Date` day);
rf1, ..., rf100 represent rainfall from day1 to day100(`Date` day);
w1, ..., w100 represent wind from day1 to day100(`Date` day).

these are the newly added column names(so there will be 400 new columns in total).

nilsinelabore
  • 4,143
  • 17
  • 65
  • 122

2 Answers2

0

I suggest creating the new 400 columns in df2 first, and then using pandas.DataFrame.merge to merge it do df1

Breaking down into the two problems:

PROBLEM 1: Calculate aggregated values for last x days

Answered here

Applying to your situation:

In[1]: df2 = pd.DataFrame({"year": [2013, 2013, 2013, 2013, 2013],
                           "month": [1, 1, 1, 1, 1],
                           "day": [1, 2, 3, 4, 5],
                           "mintemp": [26.2, 22.9, 24.8, 11.2, 10],
                           "maxtemp": [28.2, 23.9, 25.8, 22.1, 12]})
       # Create date column (type datetime64[ns])
       df2["date"] = pd.to_datetime((df2[["year", "month", "day"]]))
       # Add the 400 columns needed (I am only adding 2 as an example)
       # If you change 2 to 100 you will get your 100
       colnumber = 2
       # Maxtemp
       for i in range(1, colnumber + 1):
           col_name = "max_t" + str(i)
           df2[col_name] = df2.set_index("date").rolling(i).max()["maxtemp"].values
       # Mintemp
       for i in range(1, colnumber + 1):
           col_name = "min_t" + str(i)
           df2[col_name] = df2.set_index("date").rolling(i).min()["mintemp"].values
       # TODO: Add rainfall and wind

In[2]:df2
Out[2]: 
   year  month  day  mintemp  maxtemp       date  max_t1  max_t2  min_t1  min_t2
0  2013  1      1    26.2     28.2    2013-01-01  28.2   NaN      26.2   NaN    
1  2013  1      2    22.9     23.9    2013-01-02  23.9    28.2    22.9    22.9  
2  2013  1      3    24.8     25.8    2013-01-03  25.8    25.8    24.8    22.9  
3  2013  1      4    11.2     22.1    2013-01-04  22.1    25.8    11.2    11.2  
4  2013  1      5    10.0     12.0    2013-01-05  12.0    22.1    10.0    10.0  

PROBLEM 2: Merge two dataframe horizontally using date column as common key

You will have to convert columns to datetime first (similar answers here), and then merge the df using the common key.

In[3]:df1 = pd.DataFrame({"x0": [40, 35, 33, 38],
                          "x1": [31.05, 35.75, 22, 28],
                          "x1000": [33, 29, 20, 18],
                          "Date": ["2013-1-1", "2013-1-2", "2013-1-3", "2013-1-4"]})
    # Creating common key with type datetime64[ns]
    df1["date"] = pd.to_datetime(df1["Date"])

Out[3]:
   x0     x1  x1000      Date       date
0  40  31.05  33     2013-1-1 2013-01-01
1  35  35.75  29     2013-1-2 2013-01-02
2  33  22.00  20     2013-1-3 2013-01-03
3  38  28.00  18     2013-1-4 2013-01-04

In[4]: # Merging
       df1.merge(df2, how="left", left_on=["date"], right_on=["date"])

Out[4]:
   x0     x1  x1000      Date       date  year  month  day  mintemp  maxtemp  max_t1  max_t2  min_t1  min_t2
0  40  31.05  33     2013-1-1 2013-01-01  2013  1      1    26.2     28.2     28.2   NaN      26.2   NaN    
1  35  35.75  29     2013-1-2 2013-01-02  2013  1      2    22.9     23.9     23.9    28.2    22.9    22.9  
2  33  22.00  20     2013-1-3 2013-01-03  2013  1      3    24.8     25.8     25.8    25.8    24.8    22.9  
3  38  28.00  18     2013-1-4 2013-01-04  2013  1      4    11.2     22.1     22.1    25.8    11.2    11.2 

EDIT: Added output

d_d
  • 46
  • 7
  • I haven't tested this when the date column is not ordered, so you might have to sort it when applying to your example – d_d Oct 30 '19 at 06:10
  • Thank you for the detailed answer. It was 100 days **before** rather than after `Date`, sorry for the confusion – nilsinelabore Oct 30 '19 at 22:32
  • Shouldn't the `NaN` for `max_t2` be `23.9`? – nilsinelabore Oct 30 '19 at 23:19
  • This is looking at **before**. The reason why there are NaN is that there is no data for the day before – d_d Oct 31 '19 at 01:15
  • I notice some adjacent days have the same values, eg. `min_t1` and `min_t2` for `2013-01-04` are both 11.2. Is there something wrong? – nilsinelabore Oct 31 '19 at 02:14
  • @nilsinelabore - If you take 2013-01-04, the min temp in the same day was 11.2, and the min temp in the previous 2 days (2013-01-04 and 2013-01-03) was still 11.2. min_t1 and min_t2 should always be the same when the min temp on the previous day was higher. Isn't this expected behavior? – d_d Oct 31 '19 at 05:11
  • I think I missed out some explanation here. The `min_tx` simply means the lowest temperature of day x, rather than a comparative term with other days. Similarly max temperature is the highest temperature in a day. They are all independent of each other. – nilsinelabore Oct 31 '19 at 05:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/201639/discussion-between-d-d-and-nilsinelabore). – d_d Oct 31 '19 at 06:54
0

I assume that Date column in df1 is of datetime type. If not, convert it.

Start from such preparatory steps:

  1. In df2 convert year / month / day columns the index (of datetime type):

    df2 = df2.set_index(pd.to_datetime(df2.year * 10000 + df2.month * 100
        + df2.day, format='%Y%m%d')).drop(columns=['year', 'month', 'day'])
    
  2. Set the number of days, for which add columns:

    nDays = 3
    

    For demonstration purpose I set it to only 3, but you may change it to 100 or whatever value you wish.

  3. Define column names for new columns (first import itertools):

    cols = [ x + str(y) for x, y in itertools.product(
        ['max_t', 'min_t', 'rf', 'w'], range(1, nDays + 1)) ]
    
  4. Define a function to generate additional columns, for the current row:

    def fn(row):
        d1 = row.Date
        d2 = d1 + pd.Timedelta(nDays - 1, 'D')
        return pd.Series(df2.loc[d1:d2].values.reshape((1, -1),
            order='F').squeeze(), index=cols)
    

And now, the whole processing can be made in a single instruction, applying the above function to each row and joining the result to the original DataFrame:

df1 = df1.join(df1.apply(fn, axis=1))

Quite concise and to much extent pandasonic solution.

For demonstration how this solution works, I changed your data a little:

df1:

   x0     x1    x2    x3       Date
0  40  31.05  25.5  25.5 2013-01-03
1  35  35.75  36.5  36.5 2013-01-07

df2 (initial content):

   year  month  day  maxtemp  mintemp  rainfall  wind
0  2013      1    1     26.2     20.2         0  32.4
1  2013      1    2     22.9     20.3         0  10.0
2  2013      1    3     24.8     18.4         1  28.8
3  2013      1    4     26.6     18.3         2  33.5
4  2013      1    5     28.3     20.9         3  33.4
5  2013      1    6     28.0     21.6         4  32.8
6  2013      1    7     27.5     21.4         5  26.8
7  2013      1    8     42.3     20.9         6  25.5
8  2013      1    9     25.0     21.1         7  20.9
9  2013      1   10     25.4     20.2         8  14.0

df2 (after conversion):

            maxtemp  mintemp  rainfall  wind
2013-01-01     26.2     20.2         0  32.4
2013-01-02     22.9     20.3         0  10.0
2013-01-03     24.8     18.4         1  28.8
2013-01-04     26.6     18.3         2  33.5
2013-01-05     28.3     20.9         3  33.4
2013-01-06     28.0     21.6         4  32.8
2013-01-07     27.5     21.4         5  26.8
2013-01-08     42.3     20.9         6  25.5
2013-01-09     25.0     21.1         7  20.9
2013-01-10     25.4     20.2         8  14.0

After adding new columns, df1 contains:

   x0     x1    x2    x3       Date  max_t1  max_t2  max_t3  min_t1  min_t2  \
0  40  31.05  25.5  25.5 2013-01-03    24.8    26.6    28.3    18.4    18.3   
1  35  35.75  36.5  36.5 2013-01-07    27.5    42.3    25.0    21.4    20.9   

   min_t3  rf1  rf2  rf3    w1    w2    w3  
0    20.9  1.0  2.0  3.0  28.8  33.5  33.4  
1    21.1  5.0  6.0  7.0  26.8  25.5  20.9  

Edit following the "100 days before" comment

If the added rows are to be taken from 100 days before the current date, change the way both "border dates" in fn function are set. Something like:

def fn(row):
    d1 = row.Date - pd.Timedelta(nDays, 'D')
    d2 = row.Date - pd.Timedelta(1, 'D')
    return pd.Series(df2.loc[d1:d2].values.reshape((1, -1), order='F')
        .squeeze(), index=cols)

How to avoid increase in the number of rows

If your df2 contains multiple rows for some dates, then join of df1 with df2 results in increased number of output rows.

If df2 has for some date e.g. 3 rows then for one row from df1 with this date the result will contain just 3 rows (with the same date).

To avoid this, you have to "suppress" this repetitions.

Initially I thought about df2 = df2.drop_duplicates(...), but you wrote that one row can containt one set of values an other rows - another sets, so we can not arbitrarily leave one row and delete other (from the same date).

One of possible solutions to this problem is that after the "date index" is created, you should:

  • group df2 by the index (each group will contain all rows for particular date),
  • compute mean of each column (it omits possible NaN values),
  • save the result back under df2.

The code to do it is:

df2 = df2.groupby(level=0).mean()

Then you can join (as described above), and the number of output rows should not grow.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • I just realised it should be 100 days **before** instead of after the `Date`. Can you take that into account? Sorry for the confusion – nilsinelabore Oct 30 '19 at 22:26
  • I got this `TypeError: ("unsupported operand type(s) for -: 'str' and 'Timedelta'", 'occurred at index 0')` after replacing the old `fn` with the new `fn`. Am I doing it wrong? – nilsinelabore Oct 31 '19 at 09:58
  • Check whether *Date* column in *df1* is of *datetime* type. I wrote about it at the very begin. – Valdi_Bo Oct 31 '19 at 11:14
  • It works now! But as I increase `nDays` to 100, it returned `ValueError: columns overlap but no suffix specified: Index(['maxtemp1', 'maxtemp2', 'maxtemp3', 'mintemp1', 'mintemp2', 'mintemp3', 'rainfall1', 'rainfall2', 'rainfall3', 'wind1', 'wind2', 'wind3'], dtype='object')` – nilsinelabore Oct 31 '19 at 12:14
  • The instruction setting *cols* must be executed **after** you set *nDays* value. From you message (numbers from 1 to 3) I see that you executed this instruction with *nDays == 3*. – Valdi_Bo Oct 31 '19 at 12:22
  • It's what I did, and I had set `nDays` to 100. Could it be something else? I tested with `nDays = 3` first before increasing to 100 which gave this error. Output looks fine with `nDays = 3` – nilsinelabore Oct 31 '19 at 12:28
  • Maybe *df2* does not contain data for just 100 previous days? I wrote my program assuming that it contains. – Valdi_Bo Oct 31 '19 at 12:50
  • `df2` contains 4 years of climate data which should match with most `Date` in df1. But I think certain missing data may happen at border dates. Can we ignore it? Btw it won't work for `nDays = 10` either – nilsinelabore Oct 31 '19 at 12:54
  • The missing data are probably the source of error. Your sample data were from continuous dates, so I didn't expect any gaps. – Valdi_Bo Oct 31 '19 at 14:04
  • You mean gaps **in** the dataset? There shouldn’t be, only chance that there’s missing data is that boundary data falls short of the required date extension. Is there a way of coercion? Maybe drop the rows with incomplete data? – nilsinelabore Oct 31 '19 at 19:49
  • It's working now. I guess it has something to do with the datetime format conversion after I added in additional features. Thank you – nilsinelabore Nov 01 '19 at 01:01
  • But I had another issue: the 921 row df became 3431 rows – nilsinelabore Nov 01 '19 at 02:43
  • I suppose you have some repetitions is *df2*. Find them (using *duplicated* with *subset* for *year*, *month* and *day*) and eliminate. – Valdi_Bo Nov 01 '19 at 09:06
  • Okay, but i notice they are not completely the same. Some rows have the same `Date` and `Id` (so they are unique) but climate data added are different... – nilsinelabore Nov 01 '19 at 09:53