1

I have spent a considerable time trying to get the Linux diff and patch tools to work in python with strings. To achieve this I try to use named pipes since they seem the most robust way to go. The problem is that this doesn't work for big files.

Example:

a, b = str1, str2 # ~1MB each string

fname1, fname2 = mkfifos(2)
proc = subprocess.Popen(['diff', fname1, fname2], \
                         stdout=subprocess.PIPE, stderr=subprocess.PIPE)

print('Writing first file.')
with open(fname1, 'w') as f1:
    f1.write(a)
print('Writing second file.')
with open(fname2, 'w') as f2:
    f2.write(b)

This hangs at the first write. If figured out that if I use a[:6500] it hangs on the second write. So I would assume it has something to do with the buffer. I tried manually flushing after each write, closing, using the lowlevel os.open(f, 'r', 0) with 0 buffer but still the same issue.

I thought of looping through the write in chunks but that feels wrong in a high level language like Python. Any ideas what I am doing wrong?

jfs
  • 399,953
  • 195
  • 994
  • 1,670
Pithikos
  • 18,827
  • 15
  • 113
  • 136
  • Wouldn't the fifo buffer fill up if you just write to one first - or would diff empty only on of them gradually? – J. P. Petersen Oct 17 '16 at 10:45
  • @J.P.Petersen yes I assume that is what is happening; diff is reading both files gradually so it ends up in a deadlock. It works fine if the first write is done in a thread. – Pithikos Oct 17 '16 at 11:32
  • if the input strings `str1`, `str2` are from other processes; take a look at the question from [the subprocess tag description](http://stackoverflow.com/tags/subprocess/info) that shows ["how to emulate the bash process substitution such as `a <(b) <(c)`"](http://stackoverflow.com/q/28840575/4279) – jfs Oct 29 '16 at 11:29

1 Answers1

0

A named pipe is still a pipe. It has a finite buffer on Linux. You can't write an unlimited output unless someone reads from the other end of the pipe at the same time.

If f1.write(a) blocks then it means diff doesn't read all the input files at once (it seems logical: the purpose of the diff program is to compare files line by line--reading from the first file won't be too far ahead of the reading from the second file).

To write different data to different places concurrently, you could use threads/async.io:

#!/usr/bin/env python3
from subprocess import Popen, PIPE
from threading import Thread

def write_input_async(path, text):
    def writelines():
        with open(path, 'w') as file:
            for line in text.splitlines(keepends=True):
                file.write(line)
    Thread(target=writelines, daemon=True).start()

with named_pipes(2) as paths, \
    Popen(['diff'] + paths, stdout=PIPE,stderr=PIPE, universal_newlines=True) as p:
    for path, text in zip(paths, [a, b]):
        write_input_async(path, text)
    output, errors = p.communicate()

where named_pipes(n) context manager is defined here.

Note: unless you call .communicate(); the diff process may hang as soon as any of its stdout/stderr OS pipe buffers fill up.


You could consider whether difflib.context_diff(a, b) would work in your case.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670