The purpose of this post is discussion primarily, so even loose ideas or strings to pull would be appreciated. I'm trying to bin some data for analysis, and was wondering what is the cleanest way to bin my data using Pandas.cut
. For some context, I'm specifically trying to bin ICD-9 diagnostic data into categories and am using this list as a starting point. From what I'm reading, a common way to do this is something like this:
break_points = [0, 139, 239, ...]
labels = ['infectious and parasitic diseases', 'neoplasms', 'endocrine diseases', ...]
df['diag_codes_binned'] = pd.cut(df['diag_codes'],
bins=break_points,
labels=labels)
I recognize that this is a perfectly functional way to do this, but I don't like how hard it is to visually inspect the code and determine what range lines up with what label. I am exploring using a dictionary for this like this:
diagnosis_code_dict = {139: 'infectious and parasitic diseases',
239: 'neoplasms',
279: 'endocrine diseases',
...}
But the pd.cut function doesn't seem to get along well with my dictionary. There appears to be one way to do this using a dataframe as a lookup table with min and max values, shown here, and that seems to be one possibility (example below):
In [187]: lkp
Out[187]:
Min Max Val
0 1 99 AAA
1 100 199 BBB
2 200 299 CCC
3 300 399 DDD
Lastly, I have one more consideration for the data set that I'm working through the best way to handle. Some of the diagnostic codes start with V or E, and currently I'm planning on pre-processing these to convert them into an extension of the range and handling them that way. For example, if the range of possible non-E/V codes is range(0,1000)
, then I could convert E's into a range(1000, 2000)
and V's into a range(2000, 3000)
so that I could maintain a single lookup table or dictionary for all codes from which I could cut into however many bins I wanted. That said, this method results in some loss of the ability to at-a-glance understand these codes, so I'd be open to suggestions if there is a better way to handle this.