Don't know if the headline is good enough. Feel free to adjust it!
Here's the situation: I got a dataframe that is basically a product catalogue. In this there are two important columns. One is the product ID and one is a 12-digit category. This is some sample data. Of course, the original data contains many more products, more columns and many different categories.
products = [
{'category': 110401010601, 'product': 1000023},
{'category': 110401020601, 'product': 1000024},
{'category': 110401030601, 'product': 1000025},
{'category': 110401040601, 'product': 1000026},
{'category': 110401050601, 'product': 1000027}]
pd.DataFrame.from_records(products)
The task is to use the 12-digit category number to form parent-categories and use those parents to count the number of products that match that parent category. The parent-categories are formed in 2-digit steps. The counts per parent are later on used to find a parent for each product that has a minimum number of records (let's say 12 children). Of course, the shorter the number gets, the more products will match that number. Here's an example parent structure:
110401050601 # product category
1104010506 # 1st parent
11040105 # 2nd parent
110401 # 3rd parent
1104 # 4th parent
11 # 5th super-parent
You see that there may be many more products matching for instance the 1104 instead of just the 110401050601.
Idea 1 for Small Data: As long as you have small or medium size data fully loaded into a Pandas dataframe, this is an easy task. I solved it with this code. The disadvantage is that this code assumes that all data is in memory and each loop is another select into the full dataframe, which is not good in terms of performance. Example: for 100.000 rows and 6-parent groups (formed from the 12-digits) you may end up with 600.000 select via DataFrame.loc[...]
thus growing gradually (worst case). To prevent this I'm breaking the loop if the parent has been seen before. Remark: the df.shape[0]
method is similar to len(df)
.
df = df.drop_duplicates()
categories = df['category'].unique()
counts = dict()
for cat in categories:
counts[cat] = df.loc[df['category'] == cat].shape[0]
for i in range(10,1,-2):
parent = cat[:i]
if parent not in counts:
counts[parent] = df.loc[df['category'].str.startswith(parent)].shape[0]
else:
break
counts = {key: value for key, value in counts.items() if value >= MIN_COUNT}
Which results in something like this (using parts of my original data):
{'11': 100,
'1103': 7,
'110302': 7,
'11030202': 7,
'1103020203': 7,
'110302020301': 7,
'1104': 44,
'110401': 15,
'11040101': 15,
'1104010106': 15,
'110401010601': 15}
Idea 2 for Big Data using flatmap-reduce: Now imagine you have much much more data which is loaded row-wise and you want to achieve the same thing as above. I was thinking of using flatmap
to split the category number into its parents (one to many) using a 1-counter for each parent and then apply groupby-key
to get the count for all possible parents. The advantage of this version is, that it doesn't need all data at once and that it is not doing any selects into the dataframe. But in the flatmap-step the number of rows increases by a factor of 6 (due to 12-digit category number split into 6 groups). Since Pandas has no flatten/flatmap
method, I had to apply a work-around using unstack
(for explanation see this post).
df = df.drop_duplicates()
counts_stacked = df['category'].apply(lambda cat: [(cat[:i], 1) for i in range(10,1,-2)])
counts = counts_stacked.apply(pd.Series).unstack().reset_index(drop=True)
df_counts = pd.DataFrame.from_records(list(counts), columns=['category', 'count'])
counts = df_counts.groupby('category').count().to_dict()['count']
counts = {key: value for key, value in counts.items() if value >= MIN_COUNT}
Question: Both solutions are fine, but I wonder if there is a more elegant way to achieve the same result. I feel that I've missed something.