0

I have a semi-complex for loop which has to be applied row by row (I guess). I've read the information in e.g. 1. However, I cannot wrap my head around how I would create a dictionary using these options. Running the current loop on the dataset (120k rows and large lists inside each row)

Can someone give me a pointer / hint on how you would make it run without giving me a out of memory error killing python although featuring 100 GB RAM?

Current simplified code example:

import modin.pandas as pd
import numpy as np
import ray

ray.init()

df = pd.read_csv("file.csv")

dict_res = {}
for index, row in df.iterrows():
    list_items = row['listed_items']
    length = len(list_items)
    for i in range(0, length):
        for j in range(i+1, length):
            key_str = "{},{}".format(list_items[i], list_items[j])
            if key_str in dict_res:
                dict_res[key_str] += (1/(length-1))
            else:
                dict_res[key_str] = (1/(length-1))

Example for df["listed_items"] row entries and dict_res as a result:

row1 = [100000, 200000, 421563]

row2 = [500, 453100, 442211, ...]


dict_res = {
"100000,200000" : 0.5
"100000,421563" : 0.5
"200000,421563" : 0.5
... }

ADDITION: For simpler testing I provide a file testfile.csv:

prop,items
XY108,"[9929, 102010, 301352, 521008]"
XY109,"[382, 396, 456, 639, 883, 1291, 1333, 1969, 9929, 102010, 11457, 12425, 15770]"

We get to the df used in the example by running:

from collections import Counter
import modin.pandas as pd
import ray

ray.init()

df = pd.read_csv("testfile.csv")

def str_to_list(list_str):
    return [int(x) for x in list_str.strip('[]').split(',')]

df['items'] = df['items'].apply(str_to_list)
Ranger
  • 75
  • 7

1 Answers1

1

You don't have to iterate over the rows for this calculation. You can use apply to transform each list of items into a series of counts weighted by the length of the items, then sum the counts.

from collections import Counter
import pandas as pd

df = pd.DataFrame(
    {'listed_items': [
        ['a', 'b', 'c'],
        ['d', 'e'],
        ['a', 'b']
    ]
    }
)

def items_to_weighted_sums(items : list) -> pd.Series:
    counts = Counter()
    for i in range(0, len(items)):
        for j in range(i+1, len(items)):
            counts[f"{items[i]}_{items[j]}"] += 1
    return pd.Series(counts) / (len(items) - 1)

# prints 
# {'a_b': 1.5, 'a_c': 0.5, 'b_c': 0.5, 'd_e': 1.0}
print(df['listed_items'].apply(items_to_weighted_sums).sum().to_dict())
  • Thanks for your answer! Actually, applying some of the already optimized pandas functions for this would have been my first guess too. However, when running it on my reduced df version I get a dict that has keys that are similar if not the same as my dict, but the resulting weight is labeled NaN for all occurrences. I am currently not able to figure out why this happens, as the behaviour of the .sum() of the .apply returned serious is miraculous to me. There are no keys in the .apply() returned series. – Ranger Mar 29 '23 at 10:19
  • @Ranger do you have a reproducer for the `NaN` behavior? – Mahesh Vashishtha Mar 29 '23 at 10:43
  • Yes, I added it to the main post. Just paste the .csv and read it, should emulate the exact same environment. The "correct" version of the script returns the same amount of keys, but your algorithm just lists NaN for all but one key (which I cannot figure out why) – Ranger Mar 29 '23 at 12:58
  • @Ranger have you tried my code with pandas instead of modin? It is working for me with pandas 1.5.3, even with your CSV. – Mahesh Vashishtha Mar 29 '23 at 15:18
  • You are right! When using pandas, there is no problem in computing the numbers. So much to modin offering a "inplace" replacement for pandas. Seems weird tho, why would the other function return something else... – Ranger Mar 29 '23 at 15:45
  • Will try this at scale now in order to see if a basic pandas dataframe will do the trick or not. – Ranger Mar 29 '23 at 15:45
  • Unfortunately, it does not do the trick. Dataset is at 125k rows with 2 million entires in len(df['items']). I tried to separate the dataframe in batches and ran the processing through ProcessPoolExecutor. Still getting some OOM errors, and processing with a low batch-size and low parallel processes seems tedious... implemented it in a way where when processing is finished it saves the resulting df to pickle and clears the result from memory with gc.collect(). Any other programmatic ideas to this? Help appreciated! – Ranger Mar 31 '23 at 12:11