0

I am having an issue with pandas and writing to CSV file. When I run the python scripts I either run out of memory or my computer starts running slow after script is done running. Is there any way to chunk up the data in pieces and write the chunks to CSV? I am bit new to programing in Python.

import itertools, hashlib, pandas as pd,time
chars = ['0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f']
numbers_list = list(range(0,25))
chunksize = 1_000_000
rows = []
for combination in itertools.combinations_with_replacement(chars, 10):
        for A in numbers_list:
            pure = str(A) + ':' + str(combination) 
            B = pure.replace(")", "").replace("(", "").replace("'", "").replace(",", "").replace(" ", "") 
            C = hashlib.sha256(B.encode('utf-8')).hexdigest()
            rows.append([A , B, C])
t0 = time.time()
df = pd.DataFrame(data=rows, columns=['A', 'B', 'C'])
df.to_csv('data.csv', index=False)
tdelta = time.time() - t0
print(tdelta)

I would be really appreciative the help! Thank you!

imxitiz
  • 3,920
  • 3
  • 9
  • 33
  • You must be aware that combinations grows in exponential way so your script will be slow. Ok, now consider that you don't want that a huge `rows` var consume all your memory so try to write it to the file frecuently and after overwrite `rows` var. This can be done inside the for loop writting to memory. To write in chuncks way to an csv follow this [answer](https://stackoverflow.com/a/38531304/15879103). – StandardIO Nov 06 '22 at 02:18

1 Answers1

0

Since you are only using the dataframe to write to a file, skip it completely. You build the full data set into memory in a python list and then again in the dataframe, needlessly eating RAM. The csv module in the standard lib lets you write line by line.

import itertools, hashlib, time, csv
chars = ['0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f']
numbers_list = list(range(0,25))
chunksize = 1_000_000
with open('test.csv', 'w', newline='') as fileobj:
    writer = csv.writer(fileobj)
    for combination in itertools.combinations_with_replacement(chars, 10):
        for A in numbers_list:
            pure = str(A) + ':' + str(combination) 
            B = pure.replace(")", "").replace("(", "").replace("'", "").replace(",", "").replace(" ", "") 
            C = hashlib.sha256(B.encode('utf-8')).hexdigest()
            writer.writerow([A , B, C])

This will go fast until you've filled up the RAM cache that fronts your storage, and then will go at whatever speed the OS can get data to disk.

tdelaney
  • 73,364
  • 6
  • 83
  • 116