0

I have a dataset of product purchases logs with 6 columns: purchase_date, user_address, user_id, product_id, brand_id, retailer_id. All contain integers, except user_address which is a string.

I need to get the top 5 brands selling the most items over the whole dataset, i.e. the ones having the most entries in the data.

In SQL, I believe it would look like the following (correct me if I'm wrong):

SELECT brand_id, COUNT(*)
FROM data
GROUP BY brand_id

I've tried doing that in python with a numpy recarray as follows:

items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])

brands = np.unique(data['brand_id'])    # array of unique brands
for i, brand in enumerate(np.nditer(brands)):     # For any unique brand
    items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand]))    # get the number of rows with the current brand

top5 = np.sort(items_sold_per_brand, order='count')[-5:]    # sort the array over the count values
print(top5[::-1])    # print the last five entries

It works, except it is taking around 15 seconds to run on a ~100000 rows dataset, with roughly 12000 different brands, which seems too long. The for loop is what takes the longest.

Is there a more elegant and efficient way to do this, by maybe using numpy's recarray querying methods?

Thanks for your help!

1 Answers1

0

While the nominated duplicate numpy: most efficient frequency counts for unique values in an array is relevant, I think it ignores some important issues in this code. The accepted answer, bincount probably isn't useful. And you many need some help in applying the newer unique return_counts answer.

A test script:

import numpy as np
# simple data array
data=np.zeros((20,),dtype=[('brand_id',int),('id',int)])
data['brand_id']=np.random.randint(0,10,20)
data['id']=np.arange(20)

items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])
brands = np.unique(data['brand_id']) 
print('brands',brands)
for i, brand in enumerate(np.nditer(brands)):
    items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand]))
top5 = np.sort(items_sold_per_brand, order='count')[-5:]    
print('top5',top5[::-1]) 

# a bit of simplification
brandids = data['brand_id']
brands = np.unique(brandids)
# only need space for the unique ids 
items_sold_per_brand = np.zeros(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id'] = brands
for i, brand in enumerate(brands):  # dont need nditer
    items_sold_per_brand['count'][i] = (brandids == brand).sum()
top5 = np.sort(items_sold_per_brand, order='count')[-5:]    
print('top5',top5[::-1])    

brands,counts = np.unique(data['brand_id'],return_counts=True)
print('counts',counts)

items_sold_per_brand = np.empty(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id']=brands
items_sold_per_brand['count']=counts
tops = np.sort(items_sold_per_brand, order='count')[::-1]
print('tops',tops)

ii = np.bincount(data['brand_id'])
print('bin',ii)

produces

1030:~/mypy$ python3 stack38091849.py 
brands [0 2 3 4 5 6 7 9]
top5 [(99072, 1694566490) (681217, 1510016618) (1694566234, 1180958979)
 (147063168, 147007976) (-1225886932, 139383040)]
top5 [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2)]
counts [3 4 2 1 2 2 4 2]
tops [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2) (5, 2) (3, 2) (4, 1)]
bin [3 0 4 2 1 2 2 4 0 2]

Initializing items_sold_per_brand with empty and the size of data potentially leaves random counts that don't get over written during the iteration. zeros with the smaller brands size takes care of that.

nditer isn't needed for a simple iteration like this.

bincount is fast, but creates bins for all potential values in the data range. So potentially there are 0 size bins.

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353