Extract ranges from two similar Pandas DataFrames, and show merged with spanning headers?

Question

Consider this data:

>>> import pandas as pd
>>>
dfA = pd.DataFrame({
  "key":[1,3,6,10,15,21],
  "columnA":[10,20,30,40,50,60],
  "columnB":[100,200,300,400,500,600],
  "columnC":[110,202,330,404,550,606],
})
dfB = pd.DataFrame({
  "key":[1,2,5,9,14,20],
  "columnA":[40,50,60,70,80,90],
  "columnB":[400,500,600,700,800,900],
  "columnC":[440,505,660,707,880,909],
})
>>> dfA
   key  columnA  columnB  columnC
0    1       10      100      110
1    3       20      200      202
2    6       30      300      330
3   10       40      400      404
4   15       50      500      550
5   21       60      600      606
>>> dfB
   key  columnA  columnB  columnC
0    1       40      400      440
1    2       50      500      505
2    5       60      600      660
3    9       70      700      707
4   14       80      800      880
5   20       90      900      909

I basically want to extract range of rows where "key" column is between 2 and 16 from both tables, and show this "subrange" only for "key", "ColumnA" and "ColumnC" columns - side-by-side, so as to make a visual comparison.

Via Extract rows range with .between(), and specific columns, from Pandas DataFrame? and Merge Pandas dataframes on column, and have result sorted by same column I ended up with the following (sort of) one-liner:

>>>
pd.merge(
  dfA[dfA['key'].between(2,16)][['key','columnA','columnC']], 
  dfB[dfB['key'].between(2,16)][['key','columnA','columnC']], 
  on='key', how='outer'
).astype('Int64').sort_values('key', ignore_index=True)

   key  columnA_x  columnC_x  columnA_y  columnC_y
0    2       <NA>       <NA>         50        505
1    3         20        202       <NA>       <NA>
2    5       <NA>       <NA>         60        660
3    6         30        330       <NA>       <NA>
4    9       <NA>       <NA>         70        707
5   10         40        404       <NA>       <NA>
6   14       <NA>       <NA>         80        880
7   15         50        550       <NA>       <NA>

Essentially, this is the data I wanted to obtain - except, I'd like to know this:

Is there a way to do the above, without having to type the row range (here, .between(2,16)) twice?
How can I format these results with additional "spanning" or "merged" headers ( MultiIndex, as in Pandas to Excel (Merged Header Column) )? More precisely, the above shows column headings:

(index)   key  columnA_x  columnC_x  columnA_y  columnC_y

... and what I'd want instead, is that original column name becomes a "category" heading for columns, and actual columns become "A" or "B" below the category - for all but the (merged) "key" column... so something like:

               columnA        columnC
(index)   key        A   B          A   B

( So, this involves remapping (renaming) column names to a certain extent as well ... )

How can I do this? And can it be done in a one-liner, as in the above pd.merge example?

score 2 · Accepted Answer · answered Apr 10 '20 at 15:24

2

You can do by setting key as index and concat:

(pd.concat([d.set_index('key').loc[2:16] for d in (df1,df2)],
          axis=1, keys=['A','B'])
   .swaplevel(0,1, axis=1)
   .sort_index(axis=1)
)

Output:

    columnA       columnB        columnC       
          A     B       A      B       A      B
key                                            
2       NaN  50.0     NaN  500.0     NaN  505.0
3      20.0   NaN   200.0    NaN   202.0    NaN
5       NaN  60.0     NaN  600.0     NaN  660.0
6      30.0   NaN   300.0    NaN   330.0    NaN
9       NaN  70.0     NaN  700.0     NaN  707.0
10     40.0   NaN   400.0    NaN   404.0    NaN
14      NaN  80.0     NaN  800.0     NaN  880.0
15     50.0   NaN   500.0    NaN   550.0    NaN

answered Apr 10 '20 at 15:24

Quang Hoang

146,074
10
56
74

Thanks @QuangHoang - that looks pretty neat (and just learned that I can wrap statements like that for the interpreter in `(...)`, and I won't get syntax error complaints). However, is there a way in your solution to specify only appearance of "columnA" and "ColumnC" (and "key")? Note, in my actual case, there may be 30 columns or so, and it would be a bit tedious to "drop" them, thus I'd like to specify explicitly which are shown. – sdaau Apr 10 '20 at 15:34
1

Do you mean you want to keep only two columns `ColumnA` and `ColumnC`? Or do you have a (longer) list of columns you want to keep? – Quang Hoang Apr 10 '20 at 15:41
Thanks - I mean I only want to keep `ColumnA` and `ColumnC` (edit: and `key`) – sdaau Apr 10 '20 at 15:44
1

Then `loc[2:16, ['ColumnA','ColumnC']]` instead of just `loc[2:16]`? – Quang Hoang Apr 10 '20 at 15:46

score 0 · Answer 2 · answered Apr 10 '20 at 16:02

Do note this - as per @QuangHoang's comment:

>>>
(pd.concat([d.set_index('key').loc[2:16, ['ColumnA','ColumnC']] for d in (dfA,dfB)],
          axis=1, keys=['A','B'])
   .swaplevel(0,1, axis=1)
   .sort_index(axis=1)
)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1762, in __getitem__
    return self._getitem_tuple(key)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1289, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1954, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1595, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
    self._validate_read_indexer(
  File "C:/msys64/mingw64/lib/python3.8/site-packages/pandas/core/indexing.py", line 1640, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['ColumnA', 'ColumnC'], dtype='object')] are in the [columns]"

... however, this seems to work:

(pd.concat([ d[d['key'].between(2,16)][['key','columnA','columnC']].set_index('key') for d in (dfA,dfB) ],
          axis=1, keys=['A','B'])
   .swaplevel(0,1, axis=1)
   .sort_index(axis=1)
)

... and the output is:

    columnA       columnC
          A     B       A      B
key
2       NaN  50.0     NaN  505.0
3      20.0   NaN   202.0    NaN
5       NaN  60.0     NaN  660.0
6      30.0   NaN   330.0    NaN
9       NaN  70.0     NaN  707.0
10     40.0   NaN   404.0    NaN
14      NaN  80.0     NaN  880.0
15     50.0   NaN   550.0    NaN

Extract ranges from two similar Pandas DataFrames, and show merged with spanning headers?

2 Answers2