1

I have a function to find common, uncommon items and its rates between a given list (one list) and other lists (60,000 lists) for each user (4,000 users). Running below loop takes too long time and high momery usage with partial list construction and crash. I think due to the long returned list and heavy elements (tuples), so I divided it into two functions as below , but it seems the problem in appending list items in the tuple, [(user, [items],rate),(user, [items],rate),....]. I want to create a dataframes from returned values,

What should I do to an algorithm to get around this matter and reduce memory usage?

Iam using python 3.7, windows 10, 64-bit , RAM 8G.

common items function:

def common_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
       
    return user, com_items, com_items_rate

uncommon items function:

def uncommon_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
    
    uncom_items = list(set(list2) - set(com_items)) # uncommon items that blonge to list2
    uncom_items_rate = len(uncom_items)/len(set(list1).union(set(list2)))
    
    return user, com_items_rate, uncom_items, uncom_items_rate # common_items_rate is also needed 

Constructing the list:

common_item_rate_tuple_list = [] 

for usr in users: # users.shape = 4,000
    list1 = get_user_list(usr) # a function to get list1, it takes 0:00:00.015632 or less for a user
#     print(usr, len(list1))            

    for list2 in df["list2"]: # df.shape = 60,000

        common_item_rate_tuple = common_items(usr,list1, list2) 
        common_item_rate_tuple_list.append(common_item_rate_tuple)
        
print(len(common_item_rate_tuple_list)) # 4,000 * 60,000 = 240,000,000‬ items
# sample of common_item_rate_tuple_list:
#[(1,[2,5,8], 0.676), (1,[7,4], 0.788), ....(4000,[1,5,7,9],0.318), (4000,[8,9,6],0.521)

I looked at (Memory errors and list limits?) and (Memory error when appending to list in Python) they deal with constructed list. And I couldnot deal with suggested answer for (Python list memory error).

nucsit026
  • 652
  • 7
  • 16
  • Just looking at `common_items`, I see two problems. One, you call `set(list1)` twice (both take the same amount of time, promotional to the size of the list), even though the second one will produce the same `set` value because `list1` hasn't changed. Two, `intersection` can take any iterable; there's no particular benefit to making a `set` out of `list2` first. `x = set(list1); y = x.intersection(list2); return user, x, len(x)/len(x.union(list2))`. – chepner Jun 22 '20 at 14:37
  • @chepner Thanks for your comments I will change it. – nucsit026 Jun 22 '20 at 15:52

1 Answers1

0

There are a couple things you should consider for speed and memory management with data this big.

  • you are or should be working only with sets here because order has no meaning in your lists and you are doing a lot of intersecting of sets. So, can you change your get_user_list() function to return a set instead of a list? That will prevent all of the unnecessary conversions you are doing. Same for list2, just make a set right away
  • In your look for "uncommon items" you should just use the symmetric difference operator on the sets. Much faster, many less list -> set conversions
  • at the end of your loop, do you really want to create a list of 240M sub-lists? That is probably your memory explosion. I would suggest a dictionary with keys as user name. and you only need to create an entry in it if there are common items. If there are "sparse" matches, you will get a very much smaller data container

--- Edit w/ example

So I think your hope of keeping it in a data frame is too big. Perhaps you can do what is needed without storing it in a data frame. Dictionary makes sense. You may even be able to compute things "on the fly" and not store the data. Anyhow. Here is a toy example that shows the memory problem using 4K users and 10K "other lists". Of course the size of the intersected sets may make this vary, but it is informative:

import sys
import pandas as pd

# create list of users by index
users = list(range(4000))

match_data = list()

size_list2 = 10_000

for user in users:
    for t in range(size_list2):
        match_data.append(( user, (1,5,6,9), 0.55))   # 4 dummy matches and fake percentage


print(match_data[:4])
print(f'size of match: {sys.getsizeof(match_data)/1_000_000} MB')

df = pd.DataFrame(match_data)

print(df.head())

print(f'size of dataframe {sys.getsizeof(df)/1_000_000} MB')

This yields the following:

[(0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55)]
size of match: 335.072536 MB
   0             1     2
0  0  (1, 5, 6, 9)  0.55
1  0  (1, 5, 6, 9)  0.55
2  0  (1, 5, 6, 9)  0.55
3  0  (1, 5, 6, 9)  0.55
4  0  (1, 5, 6, 9)  0.55
size of dataframe 3200.00016 MB

You can see that a nutshell of your idea for only 10K other lists is 3.2GB in a dataframe. This will be unmanageable.

Here is an idea for a data structure just to use dictionaries all the way.

del df

# just keep it in a dictionary
data = {}   # intended format:  key= (usr, other_list) : value= [common elements]

# some fake data
user_items = {  1: {2,3,5,7,99},
                2: {3,5,88,790},
                3: {2,4,100} }

# some fake "list 2 data"
list2 = [   {1,2,3,4,5},
            {88, 100},
            {11, 13, 200}]

for user in user_items.keys():
    for idx, other_set in enumerate(list2):     # using enumerate to get the index of the other list
        common_elements = user_items.get(user) & other_set   # set intersection
        if common_elements:  # only put it into the dictionary if it is not empty
            data[(user, idx)] = common_elements

# try a couple data pulls
print(f'for user 1 and other list 0: {data.get((1, 0))}')
print(f'for user 2 and other list 2: {data.get((2, 2))}')  # use .get() to be safe.  It will return None if no entry

The output here is:

for user 1 and other list 0: {2, 3, 5}
for user 2 and other list 2: None

Your other alternative if you are going to be working with this data a lot is just to put these tables into a database like sqlite which is built in and won't bomb out your memory.

AirSquid
  • 10,214
  • 2
  • 7
  • 31
  • (point 3) This is the main issue and this is why I posted the question, I refer to this suggested solution in last similar question, I couldn't understand how to use it in my case. Could you please explain with example. (point 2) symmetric difference for two sets, I need one, but I will try it with (point1) and check the difference. but still appending list items costs memory , I tried append it alone and resulted same problems. – nucsit026 Jun 22 '20 at 15:45
  • So is your "end goal" a dataframe with 4000 rows and a column to hold the matches for each of the 60,000 separate lists, so basically a matrix of 4000 x 60,000 with the entries containing the matches as a list or tuple? – AirSquid Jun 22 '20 at 16:37
  • yes the final list is a list of tuples `[(user, [common_items],common_rate) , ....]` but I prefer it to be in long format, shape (240,000,000‬ rows, 3 columns) because it is easy to me for later operations. Another thing, for your suggestion of using dictionary, I couldnot use it because I don't know how to use it with other related vlaues (user, common_rate). – nucsit026 Jun 22 '20 at 19:36
  • I tried to get more than one session in `get_user_list()` instead of one previously (actually about 300 lists for each user) , now the dictionary is getting (60,000 * 4,000 * 300 elements) instead of (60,000 * 4,000 elements) , and I have the memory issue again (even if I reduce the number of elements), is there any suggestions to avoid that? – nucsit026 Jul 07 '20 at 22:56
  • Yes! Switch over to using a database. They are built to handle much larger tables of data and do highly efficient table joins for matches, etc. You can host one locally or for free on AWS or such. – AirSquid Jul 07 '20 at 23:13
  • Did you mean using `SQLiteI`? I am not familiar with it, but I will google it. Thanks for your kindness. – nucsit026 Jul 07 '20 at 23:24
  • That is where I would start. It is built into python and uses SQL language. Probably some learning curve to get up and running. Best to have the DB hosted on a server, which can be done for free on AWS...again, with some learning curve to get started. – AirSquid Jul 07 '20 at 23:26