1

I use Python to find some patterns in a large csv file (1.2 million lines, 250MB) and perform some modification on each line if such a pattern found. My approach is like this:

dfile=open(csvfile,'r')
lines=dfile.readlines()
dfile.close()
for i in range(0, len(lines)):
    lines[i]=f(lines[i])
# f(.) is a function that modifies line string if a pattern is found
# then I have a code to write the processed data in another csv file.

The problem is that after certain iterations, the code stops running, returning memory error. My system has 32GB RAM. How can I improve memory performance? I tried to read the data line by line using the following approach:

import cache
j=1
while True:
    line=cache.getline(csvfile,j)
    if line='':
        break
    outp=open(newfile,'w')
    outp.write(f(line))
    outp.close()
    j+=1

This approach also failed:

encoding error reading location 0X9b?!

Any solution?

If you are interested about the function and patterns in my csv file, voila. This is a small example of my csv file.

Description           Effectivity                AvailableLengths  Vendors
Screw 2" length 3"    "machine1, machine2"       25mm              "vend1, ven2"
pin 3"                machine1                   2-3/4"            vend3
pin 25mm              "machine2, machine4"       34mm              "vend5,Vend6"
Filler 2" red         machine5                   "4-1/2", 3""      vend7
"descr1, descr2"      "machin1,machin2,machine3" 50                "vend1,vend4"

The fields in the csv file are separated with commas, so the first line is like this:

Screw 2" length 3","machine1, machine2",25mm,"vend1, ven2"

A csv reader fails reading this file because of multi value fields and use of quotation for dimensions. My function (function f in the above code) replaces commas with semicolons if that comma is between two data belonging to the same field, and replaces it with 'INCH' if that quotation is a dimension thing.

f(firstline)=Screw 2INCH length 3INCH,machine1;machine2,25mm,vend1;ven2
  • 2
    have you tried using pandas? – Jeril Mar 15 '19 at 04:55
  • What do the file contents look like and please explain the pattern you're looking for. This will help immensly! Please [edit](https://stackoverflow.com/posts/55175754/edit) your question with these things not reply in a comment. – Jab Mar 15 '19 at 04:59
  • The problem is that the if I split each line by commas, or read the csv file with pandas, the number of fields in each line will be different and inconsistent. So, before reading as a csv file, I need to perform some modifications on each line. – Hossein Beheshti Fakher Mar 15 '19 at 05:00
  • Excactly why including what kind of actions you're performing and a sample of the data you're performing it on deems itself necessary. – Jab Mar 15 '19 at 05:01
  • Also, opening the outfile every iteration is definitely not a great idea. I suggest keeping it open and writting a line when needed, then closing it – Jab Mar 15 '19 at 05:04
  • @HosseinBeheshtiFakher so from what I'm getting, is that the CSV is improperly formatted and you are trying to use the above to normalize the data? – josh.chavanne Mar 15 '19 at 05:05
  • I know that the file is not nicely formatted. This code is part of my data cleansing process to make the file ready for further processing. I cannot change the system producing this csv file to have it nicely formatted. I should modify this file before working on it. – Hossein Beheshti Fakher Mar 15 '19 at 05:30
  • @blhsing Please read the question, you see it is completely different from the one you mentioned. remove your "Duplicate" mark and let it get answers! – Hossein Beheshti Fakher Mar 15 '19 at 06:04
  • The code in the linked question may look quite different from yours, but the core of the problem is basically the same, that you are trying to read all the rows into a list first before processing them. You should, like the answer to the linked question suggests, process each row directly (by calling `f` on the line and writing the returning value to a file) as you iterate through the lines of the file. – blhsing Mar 15 '19 at 07:28

1 Answers1

0

Try to use the following for encoding error:

open(csvfile, 'r', encoding = 'utf8')

For the performance, it could be the problem of your function f() that it's having a high complexity / memory consumption.

Could you paste the function f() here? You may also consider using regex if you are trying to find a pattern.

Nick Wong
  • 11
  • 1
  • Also consider seeing my comment on opening outfile in loop. It may not impact memory as much as it would performance. – Jab Mar 15 '19 at 05:06
  • Thank you. I thought opening the output file, writing something in it and closing it right after, will release some memory so my code will go until end of the csv file, but it didn't work! – Hossein Beheshti Fakher Mar 15 '19 at 05:40
  • I do not have access to my f() function right now. I will paste it here soon – Hossein Beheshti Fakher Mar 15 '19 at 05:42
  • Thank you nick. regarding your first comment, I tried several formatting, but the error didn't go. Also, in notepad++ I see that the formatting is utf-8 but cache.getline fails to read the line. – Hossein Beheshti Fakher Mar 15 '19 at 05:53
  • Try opening it with notepad++ and see the encoding. – Nick Wong Mar 17 '19 at 03:58
  • Also `pd.read_csv('test.txt', sep = '\s{2,}', quotechar = '"')` might help if it is separated by several (more than 2) spaces. Use `sep='\t'` if it is tab-separated. – Nick Wong Mar 17 '19 at 03:59