DataFrame - Categorizing Data

Question

I'm a Python / Pandas beginner and currently work on some projects with the IPython notebook. I just ran into a little problem that I couldn't solve with my book or by googling, maybe because I'm not exactly sure what term or function to search for.

Let's say I have a DataFrame with a Row

Industry Category

Software/Industry Systems
Software/Medical Systems
Software/Payment 
Electronic Components
Database Applications
Online Communities
Medical Equipment
Mobile Phones

What I want is to create a new row that assigns the rows in "Industry Category" to a "Parent Category". In this example just "Software" and "Hardware".

Industry Category                    Parent Category

Software/Industry Systems            Software
Software/Medical Systems             Software 
Software/Payment                     Software 
Electronic Components                Hardware
Database Applications                Software
Online Communities                   Software 
Medical Equipment                    Hardware
Mobile Phones                        Hardware

Note: There are about 600 Industry Category items in my list and about 30 Categories I have to sort them into.

So it would be great if there's some option to do the job with a *.csv with two rows. On the left all "Industry Category" items and on the right the desired "Parent Category" I like to apply to the dataset.

Thanks!

So are you saying you have 2 csvs, one like your first one and another that maps the industry Category with the parent category? — EdChum, Mar 23 '15 at 14:45
Assuming you can get your data into a dict format then it's a dupe of http://stackoverflow.com/q/20250771/3005188 — Ffisegydd, Mar 23 '15 at 14:46
@Ffisegydd would you expect `replace` to be faster than `map`? If the second csv was just a lookup and you set the index to be the 'Industry Category` I would expect `map` to be faster — EdChum, Mar 23 '15 at 14:48
Well, just assume I have a DataFrame and want to create a new row that assigns a Parent Category to values in"Industry Category". But I guess the _di_ is going into the right direction. — Christopher, Mar 23 '15 at 14:49
@EdChum I have no real grasp on the relative time differences between `map` and `replace` unfortunately. — Ffisegydd, Mar 23 '15 at 14:52
I think I got the idea. I have to create a csv and apply the the2nd step of this [link]( http://stackoverflow.com/questions/23057219/how-to-convert-csv-to-dictionary-using-pandas). Once I have the dict, I have to map the df with .map(category_list.get) — Christopher, Mar 23 '15 at 14:54
@Christopher I think you can either create a dict or just create a df but in the latter the index needs to be the 'Industry Category', you can use `map` only if the keys are unique which is true for dict but for a df this needs to be true for your csv data, I expect `map` to be the fastest method based on personal experience — EdChum, Mar 23 '15 at 15:17

score 1 · Answer 1 · answered Mar 23 '15 at 15:36

I do this quite a lot. I would create a dictionary and use apply and lambda.

example_dict = {'Software/Industry Systems':'Software','Software/Payment':'Software'}

dataframe['Parent Category'] = dataframe['Industry Category'].apply(lambda value: example_dict[value])

DataFrame - Categorizing Data

1 Answers1