Fast way of merging huge files(>=7 GB) into one

Question

I have three huge files, with just 2 columns, and I need both. I want to merge them into one file which I can then write to a SQLite database.

I used Python and got the job done, but it took >30 minutes and also hung my system for 10 of those. I was wondering if there is a faster way by using awk or any other unix-tool. A faster way within Python would be great too. Code written below:

'''We have tweets of three months in 3 different files.
Combine them to a single file '''
import sys, os
data1 = open(sys.argv[1], 'r')
data2 = open(sys.argv[2], 'r')
data3 = open(sys.argv[3], 'r')
data4 = open(sys.argv[4], 'w')
for line in data1:
    data4.write(line)
data1.close()
for line in data2:
    data4.write(line)
data2.close()
for line in data3:
    data4.write(line)
data3.close()
data4.close()

I thought this will take equal, if not longer time . Longer coz the 2nd time I do it, it has to append to the existing data and so , I assumed this will be equally expensive, if not more — crazyaboutliv, Jan 09 '12 at 13:54
Just a comment. Your code would be easier to read if you had named the file objects input1, input2, input3 and output1. — Stuart Woodward, Jan 10 '12 at 00:07
If you are going for maximum performance also take a look at the IO preformance vs the CPU performance. I.e. have you reached 100% CPU usage or 100% disk I/O usage during the run. If you say that the machine hung up I guess you hit 100 disk I/O on the same disk that the OS is writing to. Try reading and writing to a physically separate disk to the one that the OS is using. Also a SAN disk is faster than a typical physical disk. It's designed for high I/O performance. — Stuart Woodward, Jan 10 '12 at 00:13

rjmunro · Accepted Answer · 2012-08-06T12:12:57.197

13

The standard Unix way to merge files is cat. It may not be much faster but it will be faster.

cat file1 file2 file3 > bigfile

Rather than make a temporary file, you may be able to cat directly to sqlite

cat file1 file2 file3 | sqlite database

In python, you will probably get better performance if you copy the file in blocks rather than lines. Use file.read(65536) to read 64k of data at a time, rather than iterating through the files with for

edited Aug 06 '12 at 12:12

answered Jan 09 '12 at 13:50

rjmunro

27,203
20
110
132

Thanks . I will give cat a try :) Sad I could not think of cat earlier though :( – crazyaboutliv Jan 09 '12 at 13:54
I'd be really interested to see the definitive answer on most efficient block size. At some point we are trying to second guess the OS. – Stuart Woodward Jan 10 '12 at 00:04
1

@StuartWoodward You want a value that is bigger to copy more at once, but not big enough to waste too much of your system's RAM. You also want it to be a multiple of the underlying filesystem's block size so that you always finish on a block boundary. 65536 is a multiple of most plausible block sizes as block sizes are always a power of 2, often 4096 or 8192, and is big enough to get several blocks at a time, but not big enough to use a significant fraction of a modern machine's memory. If you are running this every day, it may be worth trying larger blocks, but 64k is a good start. – rjmunro Nov 14 '12 at 15:49

score 2 · Answer 2 · answered Jan 09 '12 at 13:48

2

On UNIX-like systems:

cat file1 file2 file3 > file4

answered Jan 09 '12 at 13:48

Sjoerd

74,049
16
131
175

score 1 · Answer 3 · edited May 23 '17 at 11:52

1

I'm assuming that you need to repeat this process and that speed is a critical factor.

Try opening the files as binary files and experiment with the size of the block that you are reading. Try 4096 and 8192 bytes as these are common underlying buffer sizes.

There is a similar question, Is it possible to speed-up python IO?, that might be of interest too.

edited May 23 '17 at 11:52

Community

1
1

answered Jan 09 '12 at 14:01

Stuart Woodward

2,138
4
21
31

Thanks ! I will check out the link – crazyaboutliv Jan 09 '12 at 15:30

Fast way of merging huge files(>=7 GB) into one

3 Answers3