I have a dataset of product purchases logs with 6 columns: purchase_date, user_address, user_id, product_id, brand_id, retailer_id. All contain integers, except user_address which is a string.
I need to get the top 5 brands selling the most items over the whole dataset, i.e. the ones having the most entries in the data.
In SQL, I believe it would look like the following (correct me if I'm wrong):
SELECT brand_id, COUNT(*)
FROM data
GROUP BY brand_id
I've tried doing that in python with a numpy recarray as follows:
items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])
brands = np.unique(data['brand_id']) # array of unique brands
for i, brand in enumerate(np.nditer(brands)): # For any unique brand
items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand])) # get the number of rows with the current brand
top5 = np.sort(items_sold_per_brand, order='count')[-5:] # sort the array over the count values
print(top5[::-1]) # print the last five entries
It works, except it is taking around 15 seconds to run on a ~100000 rows dataset, with roughly 12000 different brands, which seems too long. The for loop is what takes the longest.
Is there a more elegant and efficient way to do this, by maybe using numpy's recarray querying methods?
Thanks for your help!