-1

I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match.

my main dataframe contains about 3 million records (business_rates.csv), and reference dataframe (all_food_hygiene_data_clean_up.csv) contains about 10,000 records. I am getting this error when I process the match

ERROR: Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

I think is due to running out of memory. Can someone tell me how to resolve exit code 137 ?

import pandas as pd
from rapidfuzz import process, fuzz
from itertools import islice

import time
from dask import dataframe as dd

ref_df = pd.read_csv('all_food_hygiene_data_clean_up.csv')
df = dd.read_csv('business_rates.csv', low_memory=False)

contacts_addresses = list(df.address)
ref_addresses = list(ref_df.ref_address.unique())
post_code = list(ref_df.post_code)


scores_list = []
names = []

start = time.time()
print("start time:", time.ctime(start))

chunk_size = 1000
ref_addr_iter = iter(ref_addresses)
while ref_addr_chunk := list(islice(ref_addr_iter, chunk_size)):

    scores = process.cdist(ref_addr_chunk, contacts_addresses, scorer=fuzz.token_sort_ratio, score_cutoff=0, workers=-1)
    max_scores_idx = scores.argmax(axis=1)

    print('post_code', len(post_code))
    print('max_scores_idx', len(max_scores_idx))

    for ref_addr_idx, score_idx in enumerate(max_scores_idx):
        names.append((ref_addr_chunk[ref_addr_idx], contacts_addresses[score_idx]))
        scores_list.append(scores[ref_addr_idx, score_idx])

end = time.time()
print("end time:", time.ctime(end))

name_dict = dict(names)

match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
scores_df = pd.DataFrame(scores_list)

merged_results_01 = pd.concat([match_df, scores_df], axis=1)
merged_results_01.to_csv('merged_results_01.csv')

merged_results_02 = pd.merge(ref_df, merged_results_01, how='right', on='ref_address')
merged_results_02.to_csv('results.csv', mode='a', index=False)


Kelly Tang
  • 19
  • 5
  • always put FULL error message (starting at word "Traceback") in question (not in comments) as text (not screenshot, not link to external portal). There are other useful information in the full error/traceback. – furas Oct 07 '22 at 20:41
  • `SIGKILL` can also means pressed `Ctrl+C` – furas Oct 07 '22 at 20:42
  • thanks i have added the error message to the title... so i might have press Ctrl + C and that ended the process ? @furas ? – Kelly Tang Oct 07 '22 at 20:44
  • @furas do you think is a memory issue ? – Kelly Tang Oct 07 '22 at 20:45
  • This is almost certainly a memory issue. What OS are you on? No, it is not a `Ctrl-C` issue, that's `SIGINT`, signal 2, exit code 130. – Vercingatorix Oct 07 '22 at 20:57
  • @veringatorix Monterey – Kelly Tang Oct 07 '22 at 20:59
  • @veringatorix is there anything i can do in my code without adding more RAM ? – Kelly Tang Oct 07 '22 at 20:59
  • @KellyTang Don't use `list()`. I am thinking about an answer. – Vercingatorix Oct 07 '22 at 21:03
  • @KellyTang Did you add the `lists()` to try to solve a problem? If so, maybe if we explore the problem, we could find a better solution. What happens if you use `df.address` instead of `contacts_addresses`? Maybe try breaking the problem into smaller parts, and solve those first? – Vercingatorix Oct 07 '22 at 21:59

1 Answers1

0

The problem is you are using many list() operations, which attempt to construct a list in memory of the parameter you pass, which in this case is millions of records. Lists are expensive. This is causing an out-of-memory condition, and the operating system is killing your process (signal 9) as a result. You need to redesign your algorithm to do this in a "Pythonic" way instead of using lists. I am not sure how your algorithm works so I can't be more specific. You know your algorithm better than I do. There may be some better pandas way to access your data than lists; I'm not familiar with pandas so I can't help there.

One thing I'd note is you should study "iterators" more thoroughly. To do a list() around an islice() defeats the purpose of the islice(). Iterables allow you to process data in a memory-friendly way.

Vercingatorix
  • 1,838
  • 1
  • 13
  • 22
  • 1
    just as a background information: This implementation stems from: https://stackoverflow.com/a/73973651/11335032 and requires some sort of Sequence as input for `cdist`. So `islice` is only used to generate chunks of data (precisely to reduce memory usage). Not sure how much ram @KellyTang has, but with a chunk_size of 1000 the numpy array returned by `cdist` already requires around 10gb of memory in addition to the dataset + results – maxbachmann Oct 07 '22 at 23:21
  • i have 8 GB 2133 MHz LPDDR3 – Kelly Tang Oct 08 '22 at 14:48
  • Well then you simply do not have enough memory for this chunk size :) – maxbachmann Oct 11 '22 at 18:29