Removing for loop - using dictionary instead of pandas

Question

I have 2 lists:

customer_ids
recommendations (list of list with each list having 6000 shop_ids)

Each list in recommendations represents shops-recommended against customers in customer_ids. I have to filter out 20 shop_ids based on the shops in customer's city only.

Desired output:

recommendations- (list of list with each list having 20 shop_ids)

customer_ids = ['1','2','3',...]
recommendations = [['110','589','865'...], ['422','378','224'...],['198','974','546'...]]

Filter: shop's city == customer's city.

to extract the city for customers and shops I have 2 sql query:

df_cust_city = pd.read_sql_query("SELECT id, city_id FROM customer_table")
df_shop_city = pd.read_sql_query("SELECT shop_id, city FROM shop_table")

Code using list

filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
        cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
        df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna() #get all shops in customer city 
        df_city_filter = df_city_filter.astype(int)
        filter_shop = df_city_filter['shop_id'].astype(str).values.tolist() #make a list of shop_ids in customer city
        filtered = [x for x in shop_id if x in filter_rest] #filter recommended shop_ids based on city-filtered list
        shop_filtered = list(islice(filtered, 20))
filtered_list.append(shop_filtered) #create recommendation list of lists with only 20 filtered shop_ids

Code using pandas

filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
        cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
        df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna()
        recommended_shop = pd.DataFrame(shop_id, columns=['id'])
        recommended_shop['id'] = recommended_shop['id'].astype(int)
        shop_city_filter = pd.DataFrame(df_city_filter['id'].astype(int))
        shops_common = recommended_shop.merge(shop_id, how='inner', on='id')
        shops_common.drop_duplicates(subset="id", keep=False, inplace=True)
        filtered = shops_common.head(20)
        shop_filtered = filtered['id'].values.tolist()
filtered_list.append(shop_filtered)

Time taken for complete for loop to run:

using list: ~8000 seconds

using pandas: ~3000 seconds

I have to run the for loop 22 times.

is there a way to completely get rid of the for loop? Any tips/pointers on how to achieve this so it takes less time for 50000 customers at once. I am trying it out with dictionary.

df_cust_city:

id        city_id
00919245    1
02220205    2
02221669    2
02223750    2
02304202    2


df_shop_city:

shop_id city
28        1
29        1
30        1
31        1
32        1

score 1 · Answer 1 · answered Nov 28 '21 at 08:07

1

This will not get rid of the for loop, but how about you group customers by city first?

That way, the operations leading to filter_shop will only have to be performed N_cities times, rather than N_customers. In addition, the computation of the filtered variable might be significantly faster using a set difference.

answered Nov 28 '21 at 08:07

bizzinho

53
5

There's a 0.002 second difference when using set but order is not maintained when using sets. – afs Nov 28 '21 at 11:57

Removing for loop - using dictionary instead of pandas

1 Answers1