I have a collection of Excel spreadsheets from CMS (Medicare) that I want to analyze and have successfully imported them using pandas into a dataframe. Unfortunately, the column names are not uniform and many are similar, but vary due to random spaces, new lines, or extra information. Example:
- 'Vascular or Circulatory Disease'
- 'Vascular or Circulatory Disease (CC 104-106)'
- 'Vascular or Circulatory Disease '
OR
- 'ID\nNumber'
- 'ID \nNumber'
- 'ID Number'
I would simply change the names of the columns individually pandas: Merge two columns with different names?, but I have over 350 columns and high probability that they column names will change in the future.
Some ideas are to use regex to create cases to match names, but I am seeing it difficult to capture all cases and potential to run into new cases in the future. Another idea is to use NLP to soft match columns.
Any suggestions or libraries? Thank you!