230

I have a list of 20 file names, like ['file1.txt', 'file2.txt', ...]. I want to write a Python script to concatenate these files into a new file. I could open each file by f = open(...), read line by line by calling f.readline(), and write each line into that new file. It doesn't seem very "elegant" to me, especially the part where I have to read/write line by line.

Is there a more "elegant" way to do this in Python?

SuperStormer
  • 4,997
  • 5
  • 25
  • 35
JJ Beck
  • 5,193
  • 7
  • 32
  • 36
  • 9
    Its not python, but in shell scripting you could do something like `cat file1.txt file2.txt file3.txt ... > output.txt`. In python, if you don't like `readline()`, there is always `readlines()` or simply `read()`. – jedwards Nov 28 '12 at 19:57
  • 1
    @jedwards simply run the `cat file1.txt file2.txt file3.txt` command using `subprocess` module and you're done. But I am not sure if `cat` works in windows. – Ashwini Chaudhary Nov 28 '12 at 19:59
  • 7
    As a note, the way you describe is a terrible way to read a file. Use the ``with`` statement to ensure your files are closed properly, and iterate over the file to get lines, rather than using ``f.readline()``. – Gareth Latty Nov 28 '12 at 20:04
  • @jedwards cat doesn't work when the text file is unicode. – Avi Cohen Aug 08 '13 at 12:11
  • Actual analysis https://waymoot.org/home/python_string/ – nu everest Feb 09 '16 at 20:40

12 Answers12

319

This should do it

For large files:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

For small files:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            outfile.write(infile.read())

… and another interesting one that I thought of:

filenames = ['file1.txt', 'file2.txt', ...]
with open('path/to/output/file', 'w') as outfile:
    for line in itertools.chain.from_iterable(itertools.imap(open, filnames)):
        outfile.write(line)

Sadly, this last method leaves a few open file descriptors, which the GC should take care of anyway. I just thought it was interesting

inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • 14
    This will, for large files, be very memory inefficient. – Gareth Latty Nov 28 '12 at 20:06
  • @inspectorG4dget I don't see this code to be very time efficient for files that aren't large and that can be read entirely at once. In my opinion, it's impossible to write a code that is efficient as well for big files than for not-big files – eyquem Nov 28 '12 at 20:36
  • @eyquem: Have you actually done performance tests or profiling on any of these solutions, or are you just guessing what's going to be fast based on your intuitions of how computers work? – abarnert Nov 28 '12 at 20:57
  • 1
    @inspectorG4dget: I wasn't asking you, I was asking eyquem, who complained that your solution wasn't going to be efficient. I'm willing to bet it's more than efficient enough for the OP's use case, and for whatever use case eyquem has in mind. If he thinks it isn't, it's his responsibility to prove that before demanding that you optimize it. – abarnert Nov 28 '12 at 21:16
  • @abarnert I haven't tested running times of programs recently, but that's not intuition. Some months ago I was fond of testing run times for plenty of programs and I keep in mind the results. Though, there are sometimes surprises in the results, and you are right, the best would be to do the tests. I haven't enough motivation for that presently. – eyquem Nov 28 '12 at 21:59
  • You could just read a certain size until the end: f.read(4096) # or whatever size – Alex Huszagh May 21 '15 at 09:43
  • 2
    what are we considering a _large_ file to be? – Dee Aug 02 '15 at 22:55
  • 5
    @dee: a file so large that it's contents don't fit into main memory – inspectorG4dget Aug 02 '15 at 22:59
  • 3
    why would you decode and re-encode the whole thing? and search for newlines and all the unnecessary stuff when all that’s required is concatenating the files. the `shutil.copyfileobj` answer below will be much faster. – flying sheep Aug 19 '15 at 10:12
  • 1
    I actually did some brief profiling for small files (i.e. size of each file < 32GB = system memory). The results are: 1) The method for large files is fastest. 2) The method for small files is slower by 20%. 3) Using shutil (Meow's anwser) is slower by 150% 4) Using fileinput did not work. Usual disclaimers apply. Ciao – astabada Apr 23 '16 at 06:12
  • @inspectorG4dget how to concatenate only first 130 lines of all the files in the filenames – Gurminder Bharani Aug 23 '16 at 12:16
  • 1
    @GurminderBharani: replace `for line in infile` with `for _, line in zip(range(130, infile))` – inspectorG4dget Aug 25 '16 at 00:48
  • @astabada I did a "large" text profiling (each file was 5MB). You can reproduce my results by copy and pasting their code and using this as your [text](http://norvig.com/big.txt). Using `filenames = ['big.txt',]*1000` I got that this answers large result averaged ~130s over three runs, while Meow's answer averaged ~60s over three. Meow's solution seemed to be a clear winner for large files. – Novice C Sep 23 '16 at 03:55
  • For reference, abarnert's solution (modified to work in Python 2.7) took ~300s. – Novice C Sep 23 '16 at 04:10
  • 14
    Just to reiterate: this is the wrong answer, shutil.copyfileobj is the right answer. – Paul Crowley Apr 05 '17 at 17:05
  • 1
    becarefull if for the list filenames you use something of the sort of os.listdir(...). The result is non sorted in alphanumeric order. Use sorted(os.listdir(..)) – Antoni Jul 17 '17 at 23:05
  • @inspectorG4dget when concatenating files I noted that sometimes the last line from one file gets merged with the first line from the new file and printed on the same line. Is it possible to detect such situations in python. I guess that this happens when a new line is missing. I did search of the type `if '\n' not in line: print("line"+"\n")` somehow solved the problem, but is it possible detect this situation automatically in python. – Alexander Cska Mar 18 '19 at 11:17
  • @AlexanderCska: yes, that's possible. It can be done with `if not line.endswith('\n'): outfile.write('\n')`. But that would require you to add the check for each file-open – inspectorG4dget Apr 25 '19 at 16:50
  • Is there a method to use this code, but don't write all file names in the filenames, just one code to load all files of a datafolder –  Oct 31 '22 at 12:13
  • @J_Martin: `filenames = glob.glob(os.path.join('datafolder', "*"))` – inspectorG4dget Nov 01 '22 at 13:22
274

Use shutil.copyfileobj.

It automatically reads the input files chunk by chunk for you, which is more more efficient and reading the input files in and will work even if some of the input files are too large to fit into memory:

import shutil

with open('output_file.txt','wb') as wfd:
    for f in ['seg1.txt','seg2.txt','seg3.txt']:
        with open(f,'rb') as fd:
            shutil.copyfileobj(fd, wfd)
Jeyekomon
  • 2,878
  • 2
  • 27
  • 37
Meow
  • 4,341
  • 1
  • 18
  • 17
  • 5
    `for i in glob.glob(r'c:/Users/Desktop/folder/putty/*.txt'):` well i replaced the for statement to include all the files in directory but my `output_file` started growing really huge like in 100's of gb in very quick time. – R__raki__ Oct 05 '16 at 08:32
  • 26
    Note, that is will merge last strings of each file with first strings of next file if there are no EOL characters. In my case I got totally corrupted result after using this code. I added wfd.write(b"\n") after copyfileobj to get normal result – Thelambofgoat Feb 18 '19 at 11:25
  • 8
    @Thelambofgoat I would say that is not a pure concatenation in that case, but hey, whatever suits your needs. – HelloGoodbye Oct 18 '19 at 08:31
  • 1
    This is by far the best answer! – Kai Petzke Aug 14 '20 at 18:30
  • 1
    This is super fast and as I required. yes it does not add new line between "two files end and start" and exactly this I needed. so dont update it :D – Adnan Ali Feb 22 '21 at 09:11
66

That's exactly what fileinput is for:

import fileinput
with open(outfilename, 'w') as fout, fileinput.input(filenames) as fin:
    for line in fin:
        fout.write(line)

For this use case, it's really not much simpler than just iterating over the files manually, but in other cases, having a single iterator that iterates over all of the files as if they were a single file is very handy. (Also, the fact that fileinput closes each file as soon as it's done means there's no need to with or close each one, but that's just a one-line savings, not that big of a deal.)

There are some other nifty features in fileinput, like the ability to do in-place modifications of files just by filtering each line.


As noted in the comments, and discussed in another post, fileinput for Python 2.7 will not work as indicated. Here slight modification to make the code Python 2.7 compliant

with open('outfilename', 'w') as fout:
    fin = fileinput.input(filenames)
    for line in fin:
        fout.write(line)
    fin.close()
Novice C
  • 1,344
  • 2
  • 15
  • 27
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • @Lattyware: I think most people who learn about `fileinput` are told that it's a way to turn a simple `sys.argv` (or what's left as args after `optparse`/etc.) into a big virtual file for trivial scripts, and don't think to use it for anything else (i.e., when the list isn't command-line args). Or they do learn, but then forget—I keep re-discovering it every year or two… – abarnert Nov 28 '12 at 20:24
  • 1
    @abament I think `for line in fileinput.input()` isn't the best way to choose in this particular case: the OP wants to concatenate files, not read them line by line which is a theoretically longer process to execute – eyquem Nov 28 '12 at 20:30
  • 1
    @eyquem: It's not a longer process to execute. As you yourself pointed out, line-based solutions don't read one character at a time; they read in chunks and pull lines out of a buffer. The I/O time will completely swamp the line-parsing time, so as long as the implementor didn't do something horribly stupid in the buffering, it will be just as fast (and possibly even faster than trying to guess at a good buffer size yourself, if you think 10000 is a good choice). – abarnert Nov 28 '12 at 20:46
  • 1
    @abarnert NO, 10000 isn't a good choice. It is indeed a very bad choice because it isn't a power of 2 and it is ridiculously a little size. Better sizes would be 2097152 (2**21), 16777216 (2**24) or even 134217728 (2**27) , why not ?, 128 MB is nothing in a RAM of 4 GB. – eyquem Nov 28 '12 at 21:55
  • Huge buffers really don't help much. In fact, if you're reading more than your OS's typical readahead cache size, you'll end up waiting around for data when you could be writing. Plus, run a dozen apps that all think 128MB is nothing, and suddenly your system is thrashing swap and slowing to a crawl. It really is very easy to test this stuff, so try it and see. – abarnert Nov 28 '12 at 22:08
  • @abarnert Yes yes yes, but a learned guy who understands what he does won't trigger such a program while running 2**4 other applications. - You're right, I should better test - And oh I understand, by your use of the 'readahead' word, that you are a Linux user, aren't you ? That's why you know more about the innards than commonly, I guess – eyquem Nov 28 '12 at 22:34
  • @eyquem: Actually, I'm on a Mac. I'm currently running 147 processes, most of which are using at least 64MB of VM. In fact, it's very hard not to be running a whole lot more than 2**4 processes on any modern Windows, Mac, Linux system… or, for that matter, iOS or Android phone. – abarnert Nov 28 '12 at 22:49
  • @abarnert Gargl. There are presently 33 processes in my computer, only 7 among them being more than 20 000 KB.... – eyquem Nov 28 '12 at 23:12
  • You probably don't have the "show all processes" (or whatever it's called in current Windows) enabled. But anyway, a learned guy like you or me is always running well over 2**4 other applications when he triggers _anything_. – abarnert Nov 29 '12 at 00:18
  • 2
    Example code not quite valid for Python 2.7.10 and later: http://stackoverflow.com/questions/30835090/attributeerror-fileinput-instance-has-no-attribute-exit – CnrL Sep 25 '15 at 13:43
8
outfile.write(infile.read()) # time: 2.1085190773010254s
shutil.copyfileobj(fd, wfd, 1024*1024*10) # time: 0.60599684715271s

A simple benchmark shows that the shutil performs better.

Clint Chelak
  • 232
  • 2
  • 9
haoming
  • 787
  • 7
  • 4
7

I don't know about elegance, but this works:

    import glob
    import os
    for f in glob.glob("file*.txt"):
         os.system("cat "+f+" >> OutFile.txt")
Daniel
  • 143
  • 1
  • 1
  • 9
    you can even avoid the loop: import os; os.system("cat file*.txt >> OutFile.txt") – lib Feb 13 '15 at 14:36
  • 15
    not crossplatform and will break for file names with spaces in them – flying sheep Aug 19 '15 at 10:09
  • 6
    This is insecure; also, `cat` can take a list of files, so no need to repeatedly call it. You can easily make it safe by calling `subprocess.check_call` instead of `os.system` – Clément Nov 10 '17 at 01:22
6

What's wrong with UNIX commands ? (given you're not working on Windows) :

ls | xargs cat | tee output.txt does the job ( you can call it from python with subprocess if you want)

lucasg
  • 10,734
  • 4
  • 35
  • 57
6

If you have a lot of files in the directory then glob2 might be a better option to generate a list of filenames rather than writing them by hand.

import glob2

filenames = glob2.glob('*.txt')  # list of all .txt files in the directory

with open('outfile.txt', 'w') as f:
    for file in filenames:
        with open(file) as infile:
            f.write(infile.read()+'\n')
Michael H.
  • 3,323
  • 2
  • 23
  • 31
Sharad
  • 69
  • 1
  • 1
  • 1
    What does this have to do with the question? Why use `glob2` instead of the `glob` module, or the globbing functionality in `pathlib`? – AMC Jul 22 '20 at 19:10
  • very good and complete Python code. Works brilliant. Thanks, – Just Me Aug 28 '22 at 06:45
3

An alternative to @inspectorG4dget answer (best answer to date 29-03-2016). I tested with 3 files of 436MB.

@inspectorG4dget solution: 162 seconds

The following solution : 125 seconds

from subprocess import Popen
filenames = ['file1.txt', 'file2.txt', 'file3.txt']
fbatch = open('batch.bat','w')
str ="type "
for f in filenames:
    str+= f + " "
fbatch.write(str + " > file4results.txt")
fbatch.close()
p = Popen("batch.bat", cwd=r"Drive:\Path\to\folder")
stdout, stderr = p.communicate()

The idea is to create a batch file and execute it, taking advantage of "old good technology". Its semi-python but works faster. Works for windows.

João Palma
  • 169
  • 1
  • 1
  • 9
2

Check out the .read() method of the File object:

http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects

You could do something like:

concat = ""
for file in files:
    concat += open(file).read()

or a more 'elegant' python-way:

concat = ''.join([open(f).read() for f in files])

which, according to this article: http://www.skymind.com/~ocrow/python_string/ would also be the fastest.

Alex Kawrykow
  • 95
  • 1
  • 1
  • 3
  • 12
    This will produce a giant string, which, depending on the size of the files, could be larger than the available memory. As Python provides easy lazy access to files, it's a bad idea. – Gareth Latty Nov 28 '12 at 20:05
2

If the files are not gigantic:

with open('newfile.txt','wb') as newf:
    for filename in list_of_files:
        with open(filename,'rb') as hf:
            newf.write(hf.read())
            # newf.write('\n\n\n')   if you want to introduce
            # some blank lines between the contents of the copied files

If the files are too big to be entirely read and held in RAM, the algorithm must be a little different to read each file to be copied in a loop by chunks of fixed length, using read(10000) for example.

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • @Lattyware Because I'm quite sure the execution is faster. By the way, in fact, even when the code orders to read a file line by line, the file is read by chunks, that are put in cache in which each line is then read one after the other. The better procedure would be to put the length of read chunk equal to the size of the cache. But I don't know how to determine this cache's size. – eyquem Nov 28 '12 at 20:17
  • That's the implementation in CPython, but none of that is guaranteed. Optimizing like that is a bad idea as while it may be effective on some systems, it may not on others. – Gareth Latty Nov 28 '12 at 20:20
  • 1
    Yes, of course line-by-line reading is buffered. That's exactly why it's not that much slower. (In fact, in some cases, it may even be slightly faster, because whoever ported Python to your platform chose a much better chunk size than 10000.) If the performance of this really matters, you'll have to profile different implementations. But 99.99…% of the time, either way is more than fast enough, or the actual disk I/O is the slow part and it doesn't matter what your code does. – abarnert Nov 28 '12 at 20:20
  • Also, if you really do need to manually optimize the buffering, you'll want to use `os.open` and `os.read`, because plain `open` uses Python's wrappers around C's stdio, which means either 1 or 2 extra buffers getting in your way. – abarnert Nov 28 '12 at 20:25
  • PS, as for why 10000 is bad: Your files are probably on a disk, with blocks that are some power of bytes long. Let's say they're 4096 bytes. So, reading 10000 bytes means reading two blocks, then part of the next. Reading another 10000 means reading the rest of the next, then two blocks, then part of the next. Count up how many partial or complete block reads you have, and you're wasting a lot of time. Fortunately, the Python, stdio, filesystem, and kernel buffering and caching will hide most of these problems from you, but why try to create them in the first place? – abarnert Nov 28 '12 at 20:48
  • @abarnert You're perfectly right concerning the size 10000 being bad. I wrote it too rapidly, though I knew that I is better to choose a size being a power of 2, but I had forgotten why exactly. As you said, I keep re-learning things that I already knew once. – eyquem Nov 28 '12 at 21:15
  • @abarnert Another point is that I don't know if transfers of data are controlled more by Python's implementation or by the OS. And I wonder how one can know that. – eyquem Nov 28 '12 at 21:21
  • @abarnert I did a mistake when I wrote "cache" instead of "buffer". When refering to the buffering process, I meant that the reason of this process is that it gives more efficiency in the reading of data. And that what is true for reading lines one after the other in a buffer and re-writing them one after the other on disk, is also true at a higher level for reading chunks one after the other and putting them in RAM one after the other before re-writing them from RAM to disk one after the other. - The point being, as you said, that the reading and writing on disks are the slowest process – eyquem Nov 28 '12 at 21:31
  • @eyquem: You can know by profiling, debugging, and/or reading the code (assuming your OS is open source, at least in the relevant parts—Python of course is). If it doesn't seem worth the effort to do any of those things, you probably don't really need to know the answer. (Usually, even if you need to optimize your code, you care more about profiling your code than what's happening under the covers. But occasionally you do need to know what's happening under the covers to figure it out—or you're just curious and motivated.) – abarnert Nov 28 '12 at 21:31
  • @abarnert So, it seems to me that a code that reads large (but not gigantic; say 3 MB, why not) chunks of a gigantic (5 GB) file one after the other, and re-write the chunks one after the other on disk will be faster than a code reading line after line. Because in this last case, what will happen is in fact that the file will be read by chunks of the buffer's size and that this will be equivalent to read and re-write more chunks (of buffer's size), going from disk to buffer then to disk, while reading by big chunks will do less movements between I/O because the data are temporarily RAm-stored – eyquem Nov 28 '12 at 21:39
  • @abarnert Wow, it's difficult to me to express such complex things in english. Excuse my poor english. And maybe I have false ideas concerning all that ? I don't pretend to be a specialist. I would appreciate links to in-depth explanations concerning this complex subject. – eyquem Nov 28 '12 at 21:41
  • @abarnert I am curious and motivated to learn about the innards, yes. And it seems to me that optimization of a code isn't possible if one doesn't know a little about what is under the hood. – eyquem Nov 28 '12 at 21:46
  • @eyquem: Again, both the reading and writing are buffered. So when you call `outf.write(line)`, it doesn't go rewrite a disk block just to write those 80 characters; those 80 characters go into a buffer, and if the buffer's now over, say, 8KB, the first 8KB gets written. If 3MB were faster than 8KB, they'd use a 3MB buffer instead. So the only difference between reading and writing 3MB chunks is that you also need to do a bit of RAM work and string processing—which is much faster than disk, so it usually doesn't matter. – abarnert Nov 28 '12 at 21:50
  • @abarnert In fact, my subconscious idea is that when Python/OS have to write a 3 MB chunk, the process isn't going through the buffer, the data are sent directly from RAM on disk in a unique transfer and writing. Maybe am I wrong ? – eyquem Nov 28 '12 at 22:07
  • @eyquem: Python is not calling a routine to DMA 3MB of RAM to physical disk blocks. When you use Python's file objects, they're either wrapped around C stdio, or internally buffered in a similar way. Even when it does actual reads and writes to file descriptors, those will be cached by the OS. And modern disk drives have their own caches too, not to mention that the blocks aren't even real physical blocks anymore. Unless you're writing for an Apple ][ or something, this just isn't how things work. – abarnert Nov 28 '12 at 22:11
  • @abarnert Thank you. I think you know more than me about the subject. I don't remind to have read this kind of explanation. How do you know all that ? I would like to study this subject, but I don't know what to consult: explanations on OS, on C, on Python ...? And where to find them ? It seems to me that people are not interested by precise innards, in general, I find it is a pity. – eyquem Nov 28 '12 at 22:27
  • @eyquem: Well, I originally learned by using systems like the Apple ][ that were so simple you actually could understand all the details, and taking an OS class in college probably helped, but mainly it's just spending decades making stupid mistakes and either being corrected or figuring out the right answer… Nowadays they have free online tutorials and even course materials for just about everything, which hopefully makes things a lot easier, but I wouldn't know where to start. – abarnert Nov 28 '12 at 22:52
0
def concatFiles():
    path = 'input/'
    files = os.listdir(path)
    for idx, infile in enumerate(files):
        print ("File #" + str(idx) + "  " + infile)
    concat = ''.join([open(path + f).read() for f in files])
    with open("output_concatFile.txt", "w") as fo:
        fo.write(path + concat)

if __name__ == "__main__":
    concatFiles()
-2
  import os
  files=os.listdir()
  print(files)
  print('#',tuple(files))
  name=input('Enter the inclusive file name: ')
  exten=input('Enter the type(extension): ')
  filename=name+'.'+exten
  output_file=open(filename,'w+')
  for i in files:
    print(i)
    j=files.index(i)
    f_j=open(i,'r')
    print(f_j.read())
    for x in f_j:
      outfile.write(x)