I use Python to find some patterns in a large csv file (1.2 million lines, 250MB) and perform some modification on each line if such a pattern found. My approach is like this:
dfile=open(csvfile,'r')
lines=dfile.readlines()
dfile.close()
for i in range(0, len(lines)):
lines[i]=f(lines[i])
# f(.) is a function that modifies line string if a pattern is found
# then I have a code to write the processed data in another csv file.
The problem is that after certain iterations, the code stops running, returning memory error. My system has 32GB RAM. How can I improve memory performance? I tried to read the data line by line using the following approach:
import cache
j=1
while True:
line=cache.getline(csvfile,j)
if line='':
break
outp=open(newfile,'w')
outp.write(f(line))
outp.close()
j+=1
This approach also failed:
encoding error reading location 0X9b?!
Any solution?
If you are interested about the function and patterns in my csv file, voila. This is a small example of my csv file.
Description Effectivity AvailableLengths Vendors
Screw 2" length 3" "machine1, machine2" 25mm "vend1, ven2"
pin 3" machine1 2-3/4" vend3
pin 25mm "machine2, machine4" 34mm "vend5,Vend6"
Filler 2" red machine5 "4-1/2", 3"" vend7
"descr1, descr2" "machin1,machin2,machine3" 50 "vend1,vend4"
The fields in the csv file are separated with commas, so the first line is like this:
Screw 2" length 3","machine1, machine2",25mm,"vend1, ven2"
A csv reader fails reading this file because of multi value fields and use of quotation for dimensions. My function (function f in the above code) replaces commas with semicolons if that comma is between two data belonging to the same field, and replaces it with 'INCH' if that quotation is a dimension thing.
f(firstline)=Screw 2INCH length 3INCH,machine1;machine2,25mm,vend1;ven2