2

I have two large dataframes, but am only showing a small subset of them for convenience. One is in the following form (Table 1):

| Country |   Date   | flag | M   | notes     |  V |
|--------------------------------------------------|
| UK      | 20210319 |  1   | 3.0 | No Change | C1 |
| UK      | 20210320 |  0   | 2.0 | Extension | C2 |
| USA     | 20210405 |  0   | 4.0 | New Policy| C1 |
| CAN     | 20210405 |  0   | 1.0 | Update    | C3 |

and the other is in the form (Table 2):

| Country |   Date   |  C1 | C1_flag | C1_notes | 
|-----------------------------------------------|
| UK      | 20210319 | NaN |   NaN   | NaN      | 
| USA     | 20210405 | NaN |   NaN   | NaN      |
| AUS     | 20210505 | NaN |   NaN   | NaN      |
| NZ      | 20210506 | NaN |   NaN   | NaN      |

where after the 'C1_notes' column there are additional columns (C2, C2_flag, and C2_notes). In fact this column structure is present for the following codes in the 'V' column of the first table (C1, C2, C3, C4, C5, E1, E2, H1, H2, H3), but for brevity I am only showing C1 in Table 2.

I would like Table 2 to be populated using the information from the Table 1, such that it ultimately ends up having the form:

| Country |   Date   |  C1 | C1_flag | C1_notes  |
|------------------------------------------------|
| UK      | 20210319 | 3.0 |   1     | No Change | 
| USA     | 20210405 | 4.0 |   0     | New Policy|
| AUS     | 20210505 | NaN |   NaN   | NaN       |
| NZ      | 20210506 | NaN |   NaN   | NaN       |

where the 'C1' columns in Table 2 use the 'M' column of Table 1 as their values, and similarly 'C1_flag' column uses the 'flag' column. The populating of Table 2 is on the basis that the 'Country' and 'Date' columns in Table 2 match with their counterparts in Table 1.

I was hoping that a 'merge' type operation could deal with this, but the catch is that the values (C1...H3) are column values in Table 1, but column names in Table 2. I also thought that this was a Pivot/Stack/Melt type operation in Pandas but it turns out not to be so, and a more general mapping may be required. How would I perform this correspondence (both for C1, and also generalizing for the other C, E, and H codes)? Thanks.

Prakash_S
  • 77
  • 5
  • Does this answer your question? [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – Danail Petrov May 26 '21 at 15:42
  • A merge type operation is unlikely to help, because as mentioned the values (C1...H3) are column Values in Table 1, but column Names in Table 2 – Prakash_S May 26 '21 at 15:58

1 Answers1

1

Restructure the 1st dataframe / manipulate columns and then update the other dataframe with this restructured dataframe.

k = df1.pivot(index=['Country','Date'] , columns= ['V'] , values= ['flag','M','notes'])
k.columns = ['_'.join(col[::-1]) if 'M' not in col else col[-1] for col in k.columns]
k = k[sorted(k.columns)]
df2 = df2.set_index(['Country', 'Date'])
df2.update(k)

OUTPUT:

                   C1 C1_flag   C1_notes
Country Date                            
UK      20210319  3.0       1   NoChange
USA     20210405  4.0       0  NewPolicy
AUS     20210505  NaN     NaN        NaN
NZ      20210506  NaN     NaN        NaN

Complete code:

from numpy import nan

d1 = {'Country': {0: 'UK', 1: 'UK', 2: 'USA', 3: 'CAN'}, 'Date': {0: 20210319, 1: 20210320, 2: 20210405, 3: 20210405}, 'flag': {0: 1, 1: 0, 2: 0, 3: 0},
      'M': {0: 3.0, 1: 2.0, 2: 4.0, 3: 1.0}, 'notes': {0: 'NoChange', 1: 'Extension', 2: 'NewPolicy', 3: 'Update'}, 'V': {0: 'C1', 1: 'C2', 2: 'C1', 3: 'C3'}}
d2 = {'Country': {0: 'UK', 1: 'USA', 2: 'AUS', 3: 'NZ'}, 'Date': {0: 20210319, 1: 20210405, 2: 20210505, 3: 20210506}, 'C1': {
    0: nan, 1: nan, 2: nan, 3: nan}, 'C1_flag': {0: nan, 1: nan, 2: nan, 3: nan}, 'C1_notes': {0: nan, 1: nan, 2: nan, 3: nan}}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
k = df1.pivot(index=['Country', 'Date'], columns=[
              'V'], values=['flag', 'M', 'notes'])
k.columns = ['_'.join(col[::-1]) if 'M' not in col else col[-1]
             for col in k.columns]
k = k[sorted(k.columns)]
df2 = df2.set_index(['Country', 'Date'])
df2.update(k)

Nk03
  • 14,699
  • 2
  • 8
  • 22
  • @Prakash_S I've added the complete code. Can you try again? – Nk03 May 26 '21 at 16:24
  • Using 'pivot' in the first line gives a Value Error, but pivot_table works instead. But for some reason when I run the last line df2.update(k), I get a different Value Error: Buffer dtype mismatch, expected 'Python object' but got 'long long' – Prakash_S May 26 '21 at 16:34
  • The whole code now exectutes, but 'C_1' notes column is not updating. – Prakash_S May 26 '21 at 16:37
  • Solved the last problem by adding in the following as an extra argument of the pivot operation acting on df1: aggfunc=lambda x: ''.join(str(v) for v in x) – Prakash_S May 26 '21 at 17:11