0

Consider this data:

>>> import pandas as pd
>>>
dfA = pd.DataFrame({
  "key":[1,3,6,10,15,21],
  "columnA":[10,20,30,40,50,60],
  "columnB":[100,200,300,400,500,600],
  "columnC":[110,202,330,404,550,606],
})
dfB = pd.DataFrame({
  "key":[1,2,5,9,14,20],
  "columnA":[40,50,60,70,80,90],
  "columnB":[400,500,600,700,800,900],
  "columnC":[440,505,660,707,880,909],
})
>>> dfA
   key  columnA  columnB  columnC
0    1       10      100      110
1    3       20      200      202
2    6       30      300      330
3   10       40      400      404
4   15       50      500      550
5   21       60      600      606
>>> dfB
   key  columnA  columnB  columnC
0    1       40      400      440
1    2       50      500      505
2    5       60      600      660
3    9       70      700      707
4   14       80      800      880
5   20       90      900      909

I basically want to extract range of rows where "key" column is between 2 and 16 from both tables, and show this "subrange" only for "key", "ColumnA" and "ColumnC" columns - side-by-side, so as to make a visual comparison.

Via Extract rows range with .between(), and specific columns, from Pandas DataFrame? and Merge Pandas dataframes on column, and have result sorted by same column I ended up with the following (sort of) one-liner:

>>>
pd.merge(
  dfA[dfA['key'].between(2,16)][['key','columnA','columnC']], 
  dfB[dfB['key'].between(2,16)][['key','columnA','columnC']], 
  on='key', how='outer'
).astype('Int64').sort_values('key', ignore_index=True)

   key  columnA_x  columnC_x  columnA_y  columnC_y
0    2       <NA>       <NA>         50        505
1    3         20        202       <NA>       <NA>
2    5       <NA>       <NA>         60        660
3    6         30        330       <NA>       <NA>
4    9       <NA>       <NA>         70        707
5   10         40        404       <NA>       <NA>
6   14       <NA>       <NA>         80        880
7   15         50        550       <NA>       <NA>

Essentially, this is the data I wanted to obtain - except, I'd like to know this:

  • Is there a way to do the above, without having to type the row range (here, .between(2,16)) twice?
  • How can I format these results with additional "spanning" or "merged" headers ( MultiIndex, as in Pandas to Excel (Merged Header Column) )? More precisely, the above shows column headings:
(index)   key  columnA_x  columnC_x  columnA_y  columnC_y

... and what I'd want instead, is that original column name becomes a "category" heading for columns, and actual columns become "A" or "B" below the category - for all but the (merged) "key" column... so something like:

               columnA        columnC
(index)   key        A   B          A   B

( So, this involves remapping (renaming) column names to a certain extent as well ... )

How can I do this? And can it be done in a one-liner, as in the above pd.merge example?

sdaau
  • 36,975
  • 46
  • 198
  • 278

2 Answers2

2

You can do by setting key as index and concat:

(pd.concat([d.set_index('key').loc[2:16] for d in (df1,df2)],
          axis=1, keys=['A','B'])
   .swaplevel(0,1, axis=1)
   .sort_index(axis=1)
)

Output:

    columnA       columnB        columnC       
          A     B       A      B       A      B
key                                            
2       NaN  50.0     NaN  500.0     NaN  505.0
3      20.0   NaN   200.0    NaN   202.0    NaN
5       NaN  60.0     NaN  600.0     NaN  660.0
6      30.0   NaN   300.0    NaN   330.0    NaN
9       NaN  70.0     NaN  700.0     NaN  707.0
10     40.0   NaN   400.0    NaN   404.0    NaN
14      NaN  80.0     NaN  800.0     NaN  880.0
15     50.0   NaN   500.0    NaN   550.0    NaN
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks @QuangHoang - that looks pretty neat (and just learned that I can wrap statements like that for the interpreter in `(...)`, and I won't get syntax error complaints). However, is there a way in your solution to specify only appearance of "columnA" and "ColumnC" (and "key")? Note, in my actual case, there may be 30 columns or so, and it would be a bit tedious to "drop" them, thus I'd like to specify explicitly which are shown. – sdaau Apr 10 '20 at 15:34
  • 1
    Do you mean you want to keep only two columns `ColumnA` and `ColumnC`? Or do you have a (longer) list of columns you want to keep? – Quang Hoang Apr 10 '20 at 15:41
  • Thanks - I mean I only want to keep `ColumnA` and `ColumnC` (edit: and `key`) – sdaau Apr 10 '20 at 15:44
  • 1
    Then `loc[2:16, ['ColumnA','ColumnC']]` instead of just `loc[2:16]`? – Quang Hoang Apr 10 '20 at 15:46
0

Do note this - as per @QuangHoang's comment:

>>>
(pd.concat([d.set_index('key').loc[2:16, ['ColumnA','ColumnC']] for d in (dfA,dfB)],
          axis=1, keys=['A','B'])
   .swaplevel(0,1, axis=1)
   .sort_index(axis=1)
)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in __getitem__
    return self._getitem_tuple(key)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1289, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1954, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1595, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
    self._validate_read_indexer(
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1640, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['ColumnA', 'ColumnC'], dtype='object')] are in the [columns]"

... however, this seems to work:

(pd.concat([ d[d['key'].between(2,16)][['key','columnA','columnC']].set_index('key') for d in (dfA,dfB) ],
          axis=1, keys=['A','B'])
   .swaplevel(0,1, axis=1)
   .sort_index(axis=1)
)

... and the output is:

    columnA       columnC
          A     B       A      B
key
2       NaN  50.0     NaN  505.0
3      20.0   NaN   202.0    NaN
5       NaN  60.0     NaN  660.0
6      30.0   NaN   330.0    NaN
9       NaN  70.0     NaN  707.0
10     40.0   NaN   404.0    NaN
14      NaN  80.0     NaN  880.0
15     50.0   NaN   550.0    NaN
sdaau
  • 36,975
  • 46
  • 198
  • 278