6

I have three huge files, with just 2 columns, and I need both. I want to merge them into one file which I can then write to a SQLite database.

I used Python and got the job done, but it took >30 minutes and also hung my system for 10 of those. I was wondering if there is a faster way by using awk or any other unix-tool. A faster way within Python would be great too. Code written below:

'''We have tweets of three months in 3 different files.
Combine them to a single file '''
import sys, os
data1 = open(sys.argv[1], 'r')
data2 = open(sys.argv[2], 'r')
data3 = open(sys.argv[3], 'r')
data4 = open(sys.argv[4], 'w')
for line in data1:
    data4.write(line)
data1.close()
for line in data2:
    data4.write(line)
data2.close()
for line in data3:
    data4.write(line)
data3.close()
data4.close()
Martin Geisler
  • 72,968
  • 25
  • 171
  • 229
crazyaboutliv
  • 3,029
  • 9
  • 33
  • 50
  • Why not load the files into SQLite one at a time? – rjmunro Jan 09 '12 at 13:51
  • I thought this will take equal, if not longer time . Longer coz the 2nd time I do it, it has to append to the existing data and so , I assumed this will be equally expensive, if not more – crazyaboutliv Jan 09 '12 at 13:54
  • Just a comment. Your code would be easier to read if you had named the file objects input1, input2, input3 and output1. – Stuart Woodward Jan 10 '12 at 00:07
  • If you are going for maximum performance also take a look at the IO preformance vs the CPU performance. I.e. have you reached 100% CPU usage or 100% disk I/O usage during the run. If you say that the machine hung up I guess you hit 100 disk I/O on the same disk that the OS is writing to. Try reading and writing to a physically separate disk to the one that the OS is using. Also a SAN disk is faster than a typical physical disk. It's designed for high I/O performance. – Stuart Woodward Jan 10 '12 at 00:13

3 Answers3

13

The standard Unix way to merge files is cat. It may not be much faster but it will be faster.

cat file1 file2 file3 > bigfile

Rather than make a temporary file, you may be able to cat directly to sqlite

cat file1 file2 file3 | sqlite database

In python, you will probably get better performance if you copy the file in blocks rather than lines. Use file.read(65536) to read 64k of data at a time, rather than iterating through the files with for

rjmunro
  • 27,203
  • 20
  • 110
  • 132
  • Thanks . I will give cat a try :) Sad I could not think of cat earlier though :( – crazyaboutliv Jan 09 '12 at 13:54
  • I'd be really interested to see the definitive answer on most efficient block size. At some point we are trying to second guess the OS. – Stuart Woodward Jan 10 '12 at 00:04
  • 1
    @StuartWoodward You want a value that is bigger to copy more at once, but not big enough to waste too much of your system's RAM. You also want it to be a multiple of the underlying filesystem's block size so that you always finish on a block boundary. 65536 is a multiple of most plausible block sizes as block sizes are always a power of 2, often 4096 or 8192, and is big enough to get several blocks at a time, but not big enough to use a significant fraction of a modern machine's memory. If you are running this every day, it may be worth trying larger blocks, but 64k is a good start. – rjmunro Nov 14 '12 at 15:49
2

On UNIX-like systems:

cat file1 file2 file3 > file4
Sjoerd
  • 74,049
  • 16
  • 131
  • 175
1

I'm assuming that you need to repeat this process and that speed is a critical factor.

Try opening the files as binary files and experiment with the size of the block that you are reading. Try 4096 and 8192 bytes as these are common underlying buffer sizes.

There is a similar question, Is it possible to speed-up python IO?, that might be of interest too.

Community
  • 1
  • 1
Stuart Woodward
  • 2,138
  • 4
  • 21
  • 31