I am trying to tidy up a csv I was given where the columns are not very developer-friendly right now. I would like to use regular expressions to find multiple patterns in the column names to replace multiple conditions. For example, given df1 with leading/trailed spaces
, white space
throughout the header, parenthesis ()
, and <
, then I would like remove the leading/trailing spaces
and parenthesis
, replace the white space
with _
, and replace the <
with LESS_THAN
For example, turning df1 into df2:
df1 = pd.DataFrame({' APPLES AND LEMONS': [1,2], ' ORANGES ([POUNDS]) ': [2,1], ' BANANAS < 5 ': [8,9]})
APPLES AND LEMONS ORANGES (POUNDS) BANANAS < 5
0 1 2 8
1 2 1 9
df2 = pd.DataFrame({'APPLES_AND_LEMONS': [1,2], 'ORANGES_POUNDS': [2,1], 'BANANAS_LESS_THAN_5 ': [8,9]})
APPLES_AND_LEMONS ORANGES_POUNDS BANANAS_LESS_THAN_5
0 1 2 8
1 2 1 9
My current implementation is by just chaining a bunch of str.replaces. Is there a better way to do this? I was thinking that regular expressions could be especially useful because there are hundreds of columns and I'm sure that there will be a few more headaches that I have yet to find.
df1.columns = df1.columns.str.strip()
df1.columns = concatenated_df.columns.str.replace(' ','_').str.replace('<','LESS_THAN').str.replace('(', '').str.replace(')','')