I loaded a data file(34 million line sentence) which took 4Gigabyte memory on my laptop.
While I was doing preprocessing, the memory increased 1.5G after 2 million sentences get processed.
count = 0
for line in lines:
lines[count] = re.findall(r"[\w']+|[().,:!?;'$&]", line)
count += 1
if count % 100000 == 0:
print(count)
gc.collect()
Can somebody explain why and how to optimize it?