1

I have a collection of Excel spreadsheets from CMS (Medicare) that I want to analyze and have successfully imported them using pandas into a dataframe. Unfortunately, the column names are not uniform and many are similar, but vary due to random spaces, new lines, or extra information. Example:

  • 'Vascular or Circulatory Disease'
  • 'Vascular or Circulatory Disease (CC 104-106)'
  • 'Vascular or Circulatory Disease '

OR

  • 'ID\nNumber'
  • 'ID \nNumber'
  • 'ID Number'

I would simply change the names of the columns individually pandas: Merge two columns with different names?, but I have over 350 columns and high probability that they column names will change in the future.

Some ideas are to use regex to create cases to match names, but I am seeing it difficult to capture all cases and potential to run into new cases in the future. Another idea is to use NLP to soft match columns.

Any suggestions or libraries? Thank you!

jdpy19
  • 375
  • 1
  • 13

2 Answers2

2

You can compare the similarities between strings using the difflib built in library:

from difflib import SequenceMatcher

def get_sim_ratio(x, y):
    return SequenceMatcher(None, x, y).ratio()

print(get_sim_ratio('Vascular or Circulatory Disease', 'Vascular or Circulatory Disease (CC 104-106)'))
print(get_sim_ratio('Endocrine Disease', 'Vascular or Circulatory Disease (CC 104-106)'))

this outputs:

0.8266666666666667
0.36065573770491804

Using the output of that, you can set a certain level of sensitivity to merge the columns (i.e. if output > .5 -> merge)

Anna Nevison
  • 2,709
  • 6
  • 21
  • Thank you for the answer! I started playing around with the FuzzyWuzzy library, but will give this a shot. – jdpy19 Aug 12 '19 at 19:18
  • 1
    I ended up using the function you provided and adding Token_sort_ratio, Token_set_ratio with levels of sensitivity to determine which columns matched. – jdpy19 Aug 12 '19 at 20:38
2

If the columns are the same, but just labelled a bit differently, you can manually create a standard list of columns and set all the data frames to use those columns. That is, column 1 is always some variation on 'ID Number' and column 2 is always some variation on 'Vascular or Circulatory Disease', but there are differences in coding it.

data_frames = []
for file in files:
   df = pd.read_excel(f)
   df.columns = ['ID Number', 'Vascular or Circulatory Disease'] # and so forth
   data_frames.append(df)

combined = pd.concat(data_frames)

And if you have a consistent set of columns except that some files have more at the end (e.g. a column was added or removed at some point):

def set_columns(data, columns):
    if len(data.columns) < len(columns):
        diff = len(data.columns) - len(columns)
        data.columns = columns[:diff]
        # Add missing columns
        for i in range(diff, 0):
            data[columns[i]] = np.nan
    else:
        data.columns = columns
    return data
lgaud
  • 2,430
  • 20
  • 30