15

I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.

My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.

The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.

For example, I'll do a really simply command such as:

newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
    replaced = line.replace("string1", "string2")
    newfile.write(replaced)

And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?

I tried a few different solutions such as using:

import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
   sys.stdout.write(line.replace("string1", "string2")

But it has the same effect. Nor does reading the file in chunks such as using

f.read(10000)

I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.

Can anyone offer any advice or things to try?

I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7

vvvvv
  • 25,404
  • 19
  • 49
  • 81
user1297872
  • 151
  • 1
  • 1
  • 6
  • 1
    Reading line by line with an iterator should be a lazy operation, so it should work regardless to the size of the file. While it shouldn't affect your situation, you will also want to use ``with`` when opening files - it's a good practice than handles closing under exceptions correctly. – Gareth Latty Mar 28 '12 at 10:49
  • That worked great! Thanks so much. *edit: I tried posting the iterator code here, but it wouldn't format, so I added it to the original post. – user1297872 Mar 28 '12 at 11:08
  • Have you tried it with a different large text file? Is there something strange with the file 382mb in - some strange character that is being treated as the end of file? – neil Mar 28 '12 at 11:13
  • I have. I thought it might have been the file at first, but I tried it with ones of varying size from various sources. The large I tried was 2.6 gb and the smallest one I tried was 560 mb, but they all stop at 382 mb. – user1297872 Mar 28 '12 at 11:22
  • There's no reason your original code shouldn't have worked. It's also "lazy" as @Latty calls it. You shouldn't need to write your own iterator, or to read in chunks. – alexis Mar 28 '12 at 11:28
  • Related question: [Line reading chokes on 0x1A](http://stackoverflow.com/q/405058/222914) – Janne Karila Mar 28 '12 at 12:43
  • I'd like to note that when I said iterator, that wasn't what I meant - I meant one as in your original example (``for line in f``). So, uh, no problem I guess, but I think the right answer here is codeape's. – Gareth Latty Mar 28 '12 at 17:56

4 Answers4

24

Try:

f = open("filename.txt", "rb")

On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).

You can also specify the mode when using fileinput:

fileinput.input("filename.txt", inplace=1, mode="rb")
codeape
  • 97,830
  • 24
  • 159
  • 188
  • That also works! I like that solution the most, because how easy it is to change the existing code. – user1297872 Mar 28 '12 at 11:20
  • How there "that also works" ? This is clearly your problem. What other approach did work as well? Ah, I see in the comments, specifying a byte-lenght to be read, instead of using "readline" – jsbueno Mar 28 '12 at 12:27
  • I faced exactly the same problem. It works perfectly! – Tao Chen Dec 03 '15 at 11:36
4

Are you sure the problem is with reading and not with writing out? Do you close the file that is written to, either explicitly newfile.close() or using the with construct?

Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.

benroth
  • 2,468
  • 3
  • 24
  • 25
1

If you use the file like this:

with open("filename.txt") as f:
    for line in f:
        newfile.write(line.replace("string1", "string2"))

It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)

Serdalis
  • 10,296
  • 2
  • 38
  • 58
0

Found to solution thanks to Gareth Latty. Using an iterator:

def read_in_chunks(file, chunk_size=1000): 
   while True: 
      data = file.read(chunk_size) 
      if not data: break 
      yield data

This answer was posted as an edit to the question Python Does Not Read Entire Text File by the OP user1297872 under CC BY-SA 3.0.

vvvvv
  • 25,404
  • 19
  • 49
  • 81