5

I have a 5gb text file and i am trying to read it line by line. My file is in format-: Reviewerid<\t>pid<\t>date<\t>title<\t>body<\n> This is my code

o = open('mproducts.txt','w')
with open('reviewsNew.txt','rb') as f1:
    for line in f1:
        line = line.strip()
        line2 = line.split('\t')
        o.write(str(line))
        o.write("\n")

But i get Memory error when i try to run it. I have an 8gb ram and 1Tb space then why am i getting this error? I tried to read it in blocks but then also i get that error.

MemoryError 
Kanika Rawat
  • 183
  • 1
  • 1
  • 10
  • 3
    How long is the longest line in that file? – Francisco Oct 17 '16 at 21:56
  • @FranciscoCouzo I dont know. But when i try to open that file in EmEditor then a pop up window comes that "it contains some very large lines. Do you want to open it in binary format." By choosing binary option it displays the file correctly. – Kanika Rawat Oct 17 '16 at 21:59
  • 1
    What is `o` in `o.write()`? If you are keeping everything that you read in memory, I am not surprised that you are getting a memory error. – Akavall Oct 17 '16 at 22:00
  • Mode 'rb' opens the file in binary mode. Try 'r+'. See https://docs.python.org/2/tutorial/inputoutput.html – pscuderi Oct 17 '16 at 22:00
  • Instead of reading line by line read in fixed size chunks. That way the line size wont matter. – Paul Rooney Oct 17 '16 at 22:01
  • @PaulRooney How can i read it in chunks? – Kanika Rawat Oct 17 '16 at 22:02
  • You pass a size argument to `read` e.g. `f1.read(1024)` reads on kb. Then you can just write that to the output file (I presume thats what it is) and read the next chunk. So therefore you wouldn't use `for line in f1:` instead use `f1.read()` and loop until you get a read of size 0. – Paul Rooney Oct 17 '16 at 22:05
  • @Akavall i edited my question. o is writing every line read in a file which i need later. – Kanika Rawat Oct 17 '16 at 22:05
  • @PaulRooney i cannot read it like this because my file is in format-: <\t><\t><\t><\t><\n> . If i will read it in chunks then my last line of each chunk might not be in this format. – Kanika Rawat Oct 17 '16 at 22:09
  • Why don't you add a log line that displays the length of each line and number of the line, then you will see where it breaks. It might be that your whole file is just one big line. – Akavall Oct 17 '16 at 22:13
  • @Akavall How to add a log line? Sorry i am new to Python. Also my file is in the format described above. – Kanika Rawat Oct 17 '16 at 22:14
  • You can just print. For example: `print (len(line))` – Akavall Oct 17 '16 at 22:19
  • If you are on linux, you can do `wc -l your_file` in the terminal, to see how many lines you have in your file. – Akavall Oct 17 '16 at 22:22
  • I@Akavall i know the number of lines is 30,71,800. – Kanika Rawat Oct 17 '16 at 22:23
  • What about the longest line. Maybe [this](http://stackoverflow.com/questions/1655372/longest-line-in-a-file) would help? – Paul Rooney Oct 17 '16 at 22:25
  • @PaulRooney while printing length , it prints length of all the lines but gives memory error. So how will i get to know at which line it broke? – Kanika Rawat Oct 17 '16 at 22:39
  • 1
    Use `for i, line in enumerate(f1):` and print `i` on each iteration. The last one you see printed should be the last good line. – Paul Rooney Oct 17 '16 at 22:52

1 Answers1

4

Update:

Installing 64 bit Python solves the issue.

OP was using 32 bit Python that's why getting into memory limitation.


Reading whole comments I think this can help you.

  • You can't read file in chunk (as 1024) since you want to process data.
  • Instead, read file in chunk of lines i.e N lines at a time.
  • You can use yield keyword and itertools in Python to achieve above.

Summary : Get N lines at time, process it and then write it.

Sample Code :

from itertools import islice
#You can change num_of_lines
def get_lines(file_handle,num_of_lines = 10):
    while True:
        next_n_lines = list(islice(file_handle, num_of_lines))
        if not next_n_lines:
            break
        yield next_n_lines


o = open('mproducts.txt','w')

with open('reviewsNew.txt','r') as f1:
    for data_lines in get_lines(f1):
        for line in data_lines:
            line = line.strip()
            line2 = line.split('\t')
            o.write(str(line))
            o.write("\n")
o.close()
Community
  • 1
  • 1
Dinesh Pundkar
  • 4,160
  • 1
  • 23
  • 37