0

The purpose of this post is discussion primarily, so even loose ideas or strings to pull would be appreciated. I'm trying to bin some data for analysis, and was wondering what is the cleanest way to bin my data using Pandas.cut. For some context, I'm specifically trying to bin ICD-9 diagnostic data into categories and am using this list as a starting point. From what I'm reading, a common way to do this is something like this:

break_points = [0, 139, 239, ...]
labels = ['infectious and parasitic diseases', 'neoplasms', 'endocrine diseases', ...]
df['diag_codes_binned'] = pd.cut(df['diag_codes'],
                                 bins=break_points,
                                 labels=labels)

I recognize that this is a perfectly functional way to do this, but I don't like how hard it is to visually inspect the code and determine what range lines up with what label. I am exploring using a dictionary for this like this:

diagnosis_code_dict = {139: 'infectious and parasitic diseases',
                       239: 'neoplasms',
                       279: 'endocrine diseases',
                       ...}

But the pd.cut function doesn't seem to get along well with my dictionary. There appears to be one way to do this using a dataframe as a lookup table with min and max values, shown here, and that seems to be one possibility (example below):

In [187]: lkp
Out[187]:
   Min  Max  Val
0    1   99  AAA
1  100  199  BBB
2  200  299  CCC
3  300  399  DDD

Lastly, I have one more consideration for the data set that I'm working through the best way to handle. Some of the diagnostic codes start with V or E, and currently I'm planning on pre-processing these to convert them into an extension of the range and handling them that way. For example, if the range of possible non-E/V codes is range(0,1000), then I could convert E's into a range(1000, 2000) and V's into a range(2000, 3000) so that I could maintain a single lookup table or dictionary for all codes from which I could cut into however many bins I wanted. That said, this method results in some loss of the ability to at-a-glance understand these codes, so I'd be open to suggestions if there is a better way to handle this.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Brendan
  • 1,905
  • 2
  • 19
  • 25
  • 2
    Can you provide a small reproducible sample data set and your desired data set? Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. – MaxU - stand with Ukraine Oct 31 '17 at 17:09
  • 1
    I didn't answer the second part of your question (regarding E and V codes) partly because there are SO many different ways you could do this. The way you've hit upon would most likely be the most concise: you could string replace the E with another integer (like 3, so E001 becomes 3001) and then add all the categories to your dictionary and be done. As long as you comment this section well I think it's a sensible, easy solution. – HFBrowning Oct 31 '17 at 18:11
  • 1
    @HFBrowning, because the codes weren't consistently formatted as a char followed by 3 numbers, I decided to just go with a simple lambda that converted any strings starting with E/V to 1000/2000. Since I wasn't planning on breaking out subcategories within E/V, it was an acceptable solution. I will note that it took me longer than I'd like to admit to figure out why my pd.cut function wasn't working on the data - even though everything was now numerical, the df was still formatted as strings. A quick conversion to numerical made it work like a charm. – Brendan Nov 01 '17 at 03:07

1 Answers1

2

I would simply write a small helper function. Here's one idea:

import pandas as pd

def bin_helper(code_dict):
    break_points = [0] + sorted(code_dict) #0 added for lower bound on binning
    labels = [code_dict[value] for value in sorted(code_dict)]
    return break_points, labels

# Setting up some minimal reproducible code...
data = {'diag_codes': range(1, 300),
        'diag_codes_binned': ''}
df = pd.DataFrame.from_dict(data)
diag_code_dict = {139: 'infectious and parasitic diseases',
                  239: 'neoplasms',
                  279: 'endocrine diseases'}

# Run the function and drop it into pandas.cut
bins, labels = bin_helper(diag_code_dict)
df['diag_codes_binned'] = pd.cut(df['diag_codes'],
                                 bins=bins,
                                 labels=labels)

I agree that dictionaries (besides being an incredibly fast, versatile data structure in their own right!) are a very nice way to provide some context in your code about what data are supposed to mean. I often use a small "black-box" function to do the actual work if I need a dictionary to serve as part of my documentation.

HFBrowning
  • 2,196
  • 3
  • 23
  • 42