I have a 64Bit, 4-core, 2.50GHz, 64GB system with 13GB free memory. I am trying to read 24 csv with around 40 mil rows with the code below;
def test():
test = pd.DataFrame()
rootdir ='/XYZ/A'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
df = pd.read_csv(os.path.join(subdir, file), low_memory=False)
test = pd.concat([test, df])
return test
How can I optimize this to run faster, without the kernel dying. Should I be implementing this in Pyspark instead??? Please let me know if I missed any detail.