0

I loaded a data file(34 million line sentence) which took 4Gigabyte memory on my laptop.

While I was doing preprocessing, the memory increased 1.5G after 2 million sentences get processed.

count = 0
for line in lines:
    lines[count] = re.findall(r"[\w']+|[().,:!?;'$&]", line)
    count += 1
    if count % 100000 == 0:
        print(count)
        gc.collect()

Can somebody explain why and how to optimize it?

Bing Magic
  • 29
  • 5
  • You are storing all the outputs of the findall functions in a dict. Naturally it would use a lot of memory with > 2 Million lines. Try batch processing to minimize the ram-usage (load the data in different dicts and store these dicts on the harddrive for example) – f.wue Feb 27 '19 at 09:20
  • I really recommend you read this post: [Optimizing Python](https://stackoverflow.com/questions/7165465/optimizing-python-code), but answering to your question is normal that takes that memory it is a lot of lines... – R. García Feb 27 '19 at 09:22
  • Each Python string contains, as all objects, some additional data besides the characters (`sys.getsizeof('')` returns 49 bytes for an empty string). You replace a single string by a list of strings, which can mean a lot of overhead. You should try not to load everything in memory at once. – Thierry Lathuille Feb 27 '19 at 09:25

0 Answers0