2

I am using the command lineslist = file.readlines() of a 2GB file.

So, I guess it will create a lineslist array of 2GB or more size. So, basically is it the same as readfile = file.read(), which also creates readfile (instance/variable?) of 2GB exactly?

Why should I prefer readlines in this case?

Adding to that I have one more question, it is also mentioned here https://docs.python.org/2/tutorial/inputoutput.html:

readline(): a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous;

I don't understand the last point. So, does readlines() also have unambiguous value in the last element of its array if there is no \n in the end of the file?

We are dealing with combining the files (which were split on the basis of blocksize) So, I am thinking of choosing readlines or read. As the individual files may not be end with a \n after splitting and if readlines returns unambiguous values, it would be a problem, I think.)

PS: I haven't learnt python. So, forgive me if there is no such thing as instances in python or if I am speaking rubbish. I am just assuming.

EDIT:

Ok, I just found. It's not returning any unambiguous output.

len(lineslist)
6923798
lineslist[6923797]
"\xf4\xe5\xcf1)\xff\x16\x93\xf2\xa3-\....\xab\xbb\xcd"

So, it doesn't end with '\n'. But it's not unambiguous output eiter.

Also, no unambiguous output with readline either for the lastline.

GP92
  • 433
  • 1
  • 12
  • 30
  • 2
    Yes, it takes RAM to save both `readlines` and `read`(both read whole to memory). Use `for line in file` if memory is an issue. [This](http://stupidpythonideas.blogspot.nl/2013/06/readlines-considered-silly.html) page discourages the use of readlines. [This](http://stackoverflow.com/questions/17246260/python-readlines-usage-and-efficient-practice-for-reading) also looks at something similar. – M.T Apr 07 '16 at 07:46
  • @M.T Hi, thanks. This is exactly what I am looking for. These links are helpful. – GP92 Apr 07 '16 at 08:11

3 Answers3

1

file.read() will read the entire stream of data as 1 long string, whereas file.readlines() will create a list of lines from the stream.

Generally performance will suffer, especially in the case of large files, if you read in the entire thing all at once. The general approach is to iterate over the file object line by line, which it supports.

for line in file_object:
    # Process the line

As this way of processing will only consume memory for a line (loosely speaking) and not the entire contents of the file.

Christian Witts
  • 11,375
  • 1
  • 33
  • 46
  • Hi, but it do need memory to save the whole array instance right? I am not able to figure it out correctly – GP92 Apr 07 '16 at 07:54
  • In this case, I feel that, using readline in a for loop is preferable than readlines, though it is slow. Does it? – GP92 Apr 07 '16 at 07:56
1

If I understood your issue correctly you just want to combine (ie concatenate) files.

If memory is an issue normally for line in f is the way to go.

I tried benchmarking using a 1.9GB csv file. One possible alternative is to read in large chunks of the data which fit in memory.

Codes:

#read in large chunks - fastest in my test
chunksize = 2**16
with open(fn,'r') as f:
    chunk = f.read(chunksize)
    while chunk:
        chunk = f.read(chunksize)
#1 loop, best of 3: 4.48 s per loop

#read whole file in one go - slowest in my test
with open(fn,'r') as f:
    chunk = f.read()
#1 loop, best of 3: 11.7 s per loop

#read file using iterator over each line - most practical for most cases
with open(fn,'r') as f:
    for line in f:
        s = line
#1 loop, best of 3: 6.74 s per loop

Knowing this you could write something like:

with open(outputfile,'w') as fo:
    for inputfile in inputfiles: #assuming inputfiles is a list of filepaths
        with open(inputfile,'r') as fi:
            for chunk in iter(lambda: fi.read(chunksize), ''):
                fo.write(fi.read(chunk))
            fo.write('\n') #newline between each file(might not be necessary)
M.T
  • 4,917
  • 4
  • 33
  • 52
  • Hi, actually, no issue with memory. But, using read using chunksize instead of readlines seems to be a proper solution if file parsing is not required – GP92 Apr 07 '16 at 08:33
1

Yes, readlines() causes reading all file to variable. Much better it would be to read file line by line: f = open("file_path", "r") for line in f: print f It will cause loading only one line to RAM, so you're saving about 1.99 GB of memory :)

As I understood You want to concatenate two files. target = open("target_file", "w") f1 = open("f1", "r") f2 = open("f2", "r") for line in f1: print >> target, line for line in f2: print >> target, line target.close()

Or consider using other technology like bash: cat file1 > target cat file2 >> target

Lukasz Deptula
  • 361
  • 3
  • 7