If I have data of the following format (stored in a pandas dataframe), essentially a normalised form of categories and wares to a slug:
pandas.DataFrame:
categories slug wares
0 [developer, mac, web] alex.payne [macbook-pro, cinema-display, readynas-nv-plus...
1 [mac, musician] jona.bechtolt [audio-kontrol-1, powershot-sd1000, live, mda-...
2 [game, suit, windows] gabe.newell [oa-desk, beyond-tv, windows-xp, office, visua...
3 [developer, mac, software] steven.frank [mac-pro, macbook-air, apple-tv, itunes, addre...
And my intension is to plot graphs of categories correlated with wares, I'd need the data in a denormalised format, in some such format:
categories wares slug
0 developer macbook-pro alex.payne
1 mac macbook-pro alex.payne
2 web macbook-pro alex.payne
3 developer cinema-display alex.payne
4 mac cinema-display alex.payne
5 web cinema-display alex.payne
6 developer readynas-nv-plus alex.payne
What is the best way to convert the data from the format above to the one below, preferably the one, that also leverages the internals of numpy, so it is fast.
My approach was to this was rather a naive one, looping through each row in the data frame, maintaining a list of tuples and then passing it over to the pandas.DataFrame constructor. Any suggestion of yours will probably end up being faster and better, so suggest away!
I am also thinking about alternative representations of such a data in the pandas DataFrame, specifically, a sparse matrix. But I think this would be better particularly for groupby queries. If there are other formats or if sparse matrix renders it better for such aggregation queries, suggest how to go about it.
Here is the entire thing, for those interested: http://j.mp/lp-usesthis I ended up not doing the denormalisation the way I originally intended, instead looped over column of interest only. But any ability to denormalise better would make it better.