I have 2 lists:
- customer_ids
- recommendations (list of list with each list having 6000 shop_ids)
Each list in recommendations
represents shops-recommended against customers in customer_ids
.
I have to filter out 20 shop_ids based on the shops in customer's city only.
Desired output:
- recommendations- (list of list with each list having 20 shop_ids)
customer_ids = ['1','2','3',...]
recommendations = [['110','589','865'...], ['422','378','224'...],['198','974','546'...]]
Filter: shop's city == customer's city.
to extract the city for customers and shops I have 2 sql query:
df_cust_city = pd.read_sql_query("SELECT id, city_id FROM customer_table")
df_shop_city = pd.read_sql_query("SELECT shop_id, city FROM shop_table")
Code using list
filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna() #get all shops in customer city
df_city_filter = df_city_filter.astype(int)
filter_shop = df_city_filter['shop_id'].astype(str).values.tolist() #make a list of shop_ids in customer city
filtered = [x for x in shop_id if x in filter_rest] #filter recommended shop_ids based on city-filtered list
shop_filtered = list(islice(filtered, 20))
filtered_list.append(shop_filtered) #create recommendation list of lists with only 20 filtered shop_ids
Code using pandas
filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna()
recommended_shop = pd.DataFrame(shop_id, columns=['id'])
recommended_shop['id'] = recommended_shop['id'].astype(int)
shop_city_filter = pd.DataFrame(df_city_filter['id'].astype(int))
shops_common = recommended_shop.merge(shop_id, how='inner', on='id')
shops_common.drop_duplicates(subset="id", keep=False, inplace=True)
filtered = shops_common.head(20)
shop_filtered = filtered['id'].values.tolist()
filtered_list.append(shop_filtered)
Time taken for complete for loop to run:
using list: ~8000 seconds
using pandas: ~3000 seconds
I have to run the for loop 22 times.
is there a way to completely get rid of the for loop? Any tips/pointers on how to achieve this so it takes less time for 50000 customers at once. I am trying it out with dictionary.
df_cust_city:
id city_id
00919245 1
02220205 2
02221669 2
02223750 2
02304202 2
df_shop_city:
shop_id city
28 1
29 1
30 1
31 1
32 1