1

I hope someone might help me.

I have a dataframe that inculdes columns with similar names (see example data)

I have 3 additional lists of column names which include the original names of the columns (i.e. the string occurring before the question mark (see lists of column names)

I need to subset the df dataframe into 3 separate dataframes, based on matching the first part of the column names present in the 3 lists. The expected output at the bottom.

It has to be in lists (or something programmatic) as I have lots and lots of columns like this. I tried pattern matching but because some names are very similar, they match to multiple lists.

thank you in advance!

example data

df = {'id': ['1','2','3','4'],
        'ab? op':  ['green', 'red', 'blue', 'None'],
        'ab? 1': ['red', 'yellow', 'None', 'None'],
        'cd': ['L', 'XL', 'M','L'],
        'efab? cba' : ['husband', 'wife', 'husband', 'None'],
        'efab? 1':['son', 'grandparent', 'son', 'None'],
        'efab? 2':['None', 'son', 'None', 'None'],
        'fab? 4':['9', '10', '5', '3'], 
        'fab? po':['England', 'Scotland', 'Wales', 'NA'] }


df = pd.DataFrame(df, columns = ['id','ab? op', 'ab? 1', 'cd', 'efab? cba', 'efab? 1', 'efab? 2', 'fab? 4', 'fab? po'])

list of column names in other 3 data frames


df1_lst = ['ab', 'cd']
df2_lst = ['efab']
df3_lst = ['fab']

desired output

df1 = ['ab? op', 'ab? 1', 'cd']
df2 = ['efab? cba', 'efab? 1', 'efab? 2']
df3 = ['fab? 4', 'fab? po']
b101
  • 287
  • 1
  • 8

2 Answers2

1

You can form a dynamic regex for each df lists:

df_lists = [df1_lst, df2_lst, df3_lst]

result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]

e.g., for the first list, the regex is \b(ab|cd)\?? i.e. look for either ab or cd but they should be standalone from the left side (\b) and there might be an optional ? afterwards.

The desired entries are in the result list e.g.

>>> result[1]

  efab? cba      efab? 1 efab? 2
0   husband          son    None
1      wife  grandparent     son
2   husband          son    None
3      None         None    None
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • ayadin thanks for your comment, coming back to this can I ask what the 'fr' is before "\b ? I cannot figure it out! thanks – b101 Jun 16 '21 at 09:28
  • hi @b101. `f` makes it an [f-string](https://realpython.com/python-f-strings/) i.e., formatted string so that we put `{}` and the Python code inside are evaluated. `r` makes it a raw string so that we can use backslashes without needing to escape them i.e., without `r`, we needed to write `\\b` and `\\?` in there. Hope it helps. – Mustafa Aydın Jun 16 '21 at 09:31
0

Split column names by ?, keep the first part and check if they are in list:

df1 = df.loc[:, df.columns.str.split('?').str[0].isin(df1_lst)]
df2 = df.loc[:, df.columns.str.split('?').str[0].isin(df2_lst)]
df3 = df.loc[:, df.columns.str.split('?').str[0].isin(df3_lst)]
>>> df1
  ab? op   ab? 1  cd
0  green     red   L
1    red  yellow  XL
2   blue    None   M
3   None    None   L

>>> df2
  efab? cba      efab? 1 efab? 2
0   husband          son    None
1      wife  grandparent     son
2   husband          son    None
3      None         None    None

>>> df3
  fab? 4   fab? po
0      9   England
1     10  Scotland
2      5     Wales
3      3        NA
Corralien
  • 109,409
  • 8
  • 28
  • 52