4

I am trying to concatenate pieces specific lines together between two files. Such that I want to add something from line 2 in file2 onto line 2 of file1. Then from line 6 from file2 onto line 6 of file 1 and so on. Is there a way to simultaneously iterate through these two files to do this? (It might be helpful to know that the input files are about 15GB each).

Here is a simplified example:

File 1:

Ignore
This is a
Ignore
Ignore
Ignore
This is also a
Ignore
Ignore

File 2:

Ignore
sentence
Ignore
Ignore
Ignore
sentence
Ignore
Ignore

Output file:

Ignore
This is a sentence
Ignore
Ignore
Ignore
This is also a sentence
Ignore
Ignore
Chad Nouis
  • 6,861
  • 1
  • 27
  • 28
The Nightman
  • 5,609
  • 13
  • 41
  • 74

2 Answers2

15

Python3:

with open('bigfile_1') as bf1:
    with open('bigfile_2') as bf2:
        for line1, line2 in zip(bf1, bf2):
            process(line1, line2)

Importantly, bf1 and bf2 will not read the entire file in at once. They are iterators which know how to produce one line at a time.

zip() works fine with iterators and will produce an interator itself, in this case pairs of lines for you to process.

Using with ensures the files will be closed afterwards.

Python 2.x

import itertools

with open('bigfile_1') as bf1:
    with open('bigfile_2') as bf2:
        for line1, line2 in itertools.izip(bf1, bf2):
            process(line1, line2)

Python 2.x can't use zip the same way - it'll produce a list instead of an iterable, eating all of your system memory with those 15GB files. We need to use a special iterable version of zip.

Kenan Banks
  • 207,056
  • 34
  • 155
  • 173
  • 2
    what happens if they are not the same in length? – taesu Sep 10 '15 at 18:43
  • They are the same length in his example data. – saarrrr Sep 10 '15 at 18:43
  • 1
    @saarrr thus the **if**? – taesu Sep 10 '15 at 18:44
  • 3
    Note that Python 2's `zip` function returns a list with all the pairs in it. That will probably blow up the system's memory for large files. Use `itertools.izip` to get behavior like Python 3's `zip` (which is a generator). – Blckknght Sep 10 '15 at 18:44
  • 3
    @triptych *python 2* returns the list. you're showing docs for python 3. please check the version of the API docs. – acushner Sep 10 '15 at 18:48
  • 4
    @Triptych You should combine the first two lines: `with open('bigfile_1) as bf1, open('bigfile_2') as bf2:` and use normal 4 space indents. – Markus Meskanen Sep 10 '15 at 18:49
  • @MarkusMeskanen why is it relevant to combine the first two lines? – Jason Angel Aug 18 '20 at 11:12
  • to note, you can add `enumerate` by specifying the zip return, e.g. `for i, (line1, line2) in enumerate(zip(bf1, bf2)):` thus getting the line_index – Jason Angel Aug 18 '20 at 11:19
  • 1
    @JasonAngel It reduces excess indentation of the rest of the code, and these two files are "equally important" and don't rely on each other, so it just makes sense that they're opened on the same level. Functionality wise there is no difference. – Markus Meskanen Aug 20 '20 at 06:53
2

You can use the zip built-in to loop over multiple iterables at once.

Example

x = y = [1, 2, 3]
for a, b in zip(x, y):
    print(a, b)

The output will look like this:

1 1
2 2
3 3

The same principle will work for your files.

with open("/path/to/file-1") as file_1:
    with open("/path/to/file-2") as file_2:
        for line_1, line_2 in zip(file_1, file_2):
            print(a, b)

Your output will be the concatenation of the matching lines from either file separated by a single space.

Zach Gates
  • 4,045
  • 1
  • 27
  • 51
  • 5
    "The input files are about 15GB each" so `readlines()` on both into memory at once will fail almost certainly. – Two-Bit Alchemist Sep 10 '15 at 18:30
  • 1
    Not only lists, but any iterable. It is better to use the with idiom, – PuercoPop Sep 10 '15 at 18:30
  • This is a memory hog for files that big. – saarrrr Sep 10 '15 at 18:30
  • 4
    No, your edit is wrong. Not only does your `with` code not show how to iterate two files simultaneously, your "warning" is outright false. It will not be "VERY slow" -- it will run out of memory trying to allocate 30GB of RAM and fail utterly. – Two-Bit Alchemist Sep 10 '15 at 18:33
  • 1
    @Two-BitAlchemist So edit the mistake or provide a better answer. No need for the unnecessary rudeness. – PuercoPop Sep 10 '15 at 18:37
  • 5
    @PuercoPop It is against site etiquette for me to edit his answer and code into something that it is not. What is rude about saying exactly what is wrong with a provided answer when it does not answer the question? – Two-Bit Alchemist Sep 10 '15 at 18:38
  • No, it won't necessarily "run out". That is dependent upon the machine. OP didn't list any hardware specs, so who's to say? @Two-BitAlchemist As far as my edit goes, I submitted what I have currently to provide a rough lining. I can only edit (and type for that matter) so fast. – Zach Gates Sep 10 '15 at 18:39
  • 1
    As far as my edit "not show[ing] how to iterate two files simultaneously", it's the same principle of using the `zip` built-in. I also linked the documentation.. @Two-BitAlchemist – Zach Gates Sep 10 '15 at 18:43
  • 4
    In Python 2 you can use `itertools.izip` to avoid reading the whole file at once. In Python 3, that's the default behavior of `zip`. – Blckknght Sep 10 '15 at 18:43
  • 1
    @Two-BitAlchemist The tone of your first sentence is certainly rude. You are outright reiterating he is wrong. Also the answer is use zip. That they greedily read the whole file into memory is an implementation detail. – PuercoPop Sep 10 '15 at 18:53
  • 1
    This edit of the solution works for python 3 and OP did not state python 2 until after it was written so I have upvoted it. – Josh J Sep 10 '15 at 18:53
  • 2
    @PuercoPop Those two comments address different edits in the revision history. I downvoted, explained. An edit was made. I explained why it did not address the problem. If you think it's rude, rather than arguing with me in comments, flag it as rude/offensive and have a mod take a look at it. The point about when OP stated they were using Python 2 is invalid: that's the reason you wait to answer questions after clarifying comments are made. If you answer while it's still ambiguous, it could turn out you are wrong. – Two-Bit Alchemist Sep 10 '15 at 18:56
  • 1
    @ZachGates I think it's a little silly to post an answer that just assumes someone might have 30GB of RAM. Most commonly available computers would not have that, and I find that defense disingenuous. As far as providing incomplete edits and only typing so fast and getting a downvote, that's just the way it is. I frequently revisit my actions and respond to comments, but not everyone does. If you submit something incomplete, and don't like the way it's reviewed, next time submit it when it's done. – Two-Bit Alchemist Sep 10 '15 at 19:02