109

I have a Python function that writes an output file to disk.

I want to write a unit test for it using Python's unittest module.

How should I assert equality of files? I would like to get an error if the file content differs from the expected one + list of differences. As in the output of the Unix diff command.

Is there an official or recommended way of doing that?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jan
  • 1,593
  • 2
  • 12
  • 15

7 Answers7

90

I prefer to have output functions explicitly accept a file handle (or file-like object), rather than accept a file name and opening the file themselves. This way, I can pass a StringIO object to the output function in my unit test, then .read() the contents back from that StringIO object (after a .seek(0) call) and compare with my expected output.

For example, we would transition code like this

##File:lamb.py
import sys


def write_lamb(outfile_path):
    with open(outfile_path, 'w') as outfile:
        outfile.write("Mary had a little lamb.\n")


if __name__ == '__main__':
    write_lamb(sys.argv[1])



##File test_lamb.py
import unittest
import tempfile

import lamb


class LambTests(unittest.TestCase):
    def test_lamb_output(self):
        outfile_path = tempfile.mkstemp()[1]
        try:
            lamb.write_lamb(outfile_path)
            contents = open(tempfile_path).read()
        finally:
            # NOTE: To retain the tempfile if the test fails, remove
            # the try-finally clauses
            os.remove(outfile_path)
        self.assertEqual(contents, "Mary had a little lamb.\n")

to code like this

##File:lamb.py
import sys


def write_lamb(outfile):
    outfile.write("Mary had a little lamb.\n")


if __name__ == '__main__':
    with open(sys.argv[1], 'w') as outfile:
        write_lamb(outfile)



##File test_lamb.py
import unittest
from io import StringIO

import lamb


class LambTests(unittest.TestCase):
    def test_lamb_output(self):
        outfile = StringIO()
        # NOTE: Alternatively, for Python 2.6+, you can use
        # tempfile.SpooledTemporaryFile, e.g.,
        #outfile = tempfile.SpooledTemporaryFile(10 ** 9)
        lamb.write_lamb(outfile)
        outfile.seek(0)
        content = outfile.read()
        self.assertEqual(content, "Mary had a little lamb.\n")

This approach has the added benefit of making your output function more flexible if, for instance, you decide you don't want to write to a file, but some other buffer, since it will accept all file-like objects.

Note that using StringIO assumes the contents of the test output can fit into main memory. For very large output, you can use a temporary file approach (e.g., tempfile.SpooledTemporaryFile).

William
  • 3
  • 2
gotgenes
  • 38,661
  • 28
  • 100
  • 128
  • 2
    This is better then writing a file to disk. If you are running tons of unittests, IO to disk causes all kinds of problems, especially trying to clean them up. I had tests writing to disk, the tearDown deleting the written files. Tests would work fine one at a time, then fail when Run All. At least with Visual Studio and PyTools on a Win machine. Also, speed. – srock Jul 30 '15 at 17:47
  • 3
    While this is a nice solution to test separate functions, it is still troublesome when testing the actual interface that your program provides (e.g. a CLI tool).. – Joost Oct 27 '15 at 13:39
  • 1
    I got error: TypeError: unicode argument expected, got 'str' – cn123h Jan 21 '19 at 16:10
  • I came here because I'm trying to write unit tests for walking and reading partitioned parquet datasets file-by-file. This requires parsing the file path to obtain the key/value pairs to assign the appropriate value of a partition to (ultimately) the resultant pandas DataFrame. Writing to a buffer, while nice, doesn't give me the ability to parse for partition values. – PMende May 14 '19 at 22:46
  • 1
    @PMende It sounds like you're working with an API that needs interaction with an actual filesystem. Unit tests are not always the appropriate level of testing. It is okay to not test all parts of your code at the level of unit tests; integration or system tests should be used where appropriate, too. Try to contain those parts, though, and pass only simple values between boundaries whenever possible. See https://www.youtube.com/watch?v=eOYal8elnZk – gotgenes May 14 '19 at 23:25
  • I wound up using Python's `tempfile` module to randomly generate temporary partitioned DataSets. But you brought up some good things to think about. Thanks for the input! – PMende May 14 '19 at 23:43
  • 1up for always do whatever you can to avoid reading/writing to disk in tests. – Benp44 Jun 10 '19 at 13:59
  • @cn123h I got the same issue and found that using CStringIO did not elicit the error but writing to a real file did. I therefore had to use a real file in the testing. – NeilG Jul 17 '19 at 05:40
58

The simplest thing is to write the output file, then read its contents, read the contents of the gold (expected) file, and compare them with simple string equality. If they are the same, delete the output file. If they are different, raise an assertion.

This way, when the tests are done, every failed test will be represented with an output file, and you can use a third-party tool to diff them against the gold files (Beyond Compare is wonderful for this).

If you really want to provide your own diff output, remember that the Python stdlib has the difflib module. The new unittest support in Python 3.1 includes an assertMultiLineEqual method that uses it to show diffs, similar to this:

    def assertMultiLineEqual(self, first, second, msg=None):
        """Assert that two multi-line strings are equal.

        If they aren't, show a nice diff.

        """
        self.assertTrue(isinstance(first, str),
                'First argument is not a string')
        self.assertTrue(isinstance(second, str),
                'Second argument is not a string')

        if first != second:
            message = ''.join(difflib.ndiff(first.splitlines(True),
                                                second.splitlines(True)))
            if msg:
                message += " : " + msg
            self.fail("Multi-line strings are unequal:\n" + message)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • 4
    no, the best way overall is not to write to a file which can be slow and is error prone (prod env may be completely different than test/CI env, like Windows vs. OSX), but instead to mock the call to `open` as described in other answers on this page, using `unittest.mock` (see answer from Enrico M) – Eric Dec 06 '20 at 17:23
37

I always try to avoid writing files to disk, even if it's a temporary folder dedicated to my tests: not actually touching the disk makes your tests much faster, especially if you interact with files a lot in your code.

Suppose you have this "amazing" piece of software in a file called main.py:

"""
main.py
"""

def write_to_file(text):
    with open("output.txt", "w") as h:
        h.write(text)

if __name__ == "__main__":
    write_to_file("Every great dream begins with a dreamer.")

To test the write_to_file method, you can write something like this in a file in the same folder called test_main.py:

"""
test_main.py
"""
from unittest.mock import patch, mock_open

import main


def test_do_stuff_with_file():
    open_mock = mock_open()
    with patch("main.open", open_mock, create=True):
        main.write_to_file("test-data")

    open_mock.assert_called_with("output.txt", "w")
    open_mock.return_value.write.assert_called_once_with("test-data")
Enrico Marchesin
  • 4,650
  • 2
  • 20
  • 15
  • 2
    This answer is what I was looking for; not how to test equality with something in a file, but just how to test that the `open` was called successfully without needing to mess about with real file I/O. This question was asking something different, but searching for mine brought me here first. – cjameson Sep 09 '22 at 16:37
23
import filecmp

Then

self.assertTrue(filecmp.cmp(path1, path2))
tbc0
  • 1,563
  • 1
  • 17
  • 21
  • 3
    By [default](https://docs.python.org/3/library/filecmp.html#filecmp.cmp) this does a `shallow` comparison which checks only the files metadata (mtime, size, etc). Please add `shallow=False` in your example. – famzah May 16 '20 at 15:59
  • 3
    Additionally, results are [cached](https://docs.python.org/3/library/filecmp.html#filecmp.clear_cache). – famzah May 16 '20 at 16:01
3

You could separate the content generation from the file handling. That way, you can test that the content is correct without having to mess around with temporary files and cleaning them up afterward.

If you write a generator method that yields each line of content, then you can have a file handling method that opens a file and calls file.writelines() with the sequence of lines. The two methods could even be on the same class: test code would call the generator, and production code would call the file handler.

Here's an example that shows all three ways to test. Usually, you would just pick one, depending on what methods are available on the class to test.

import os
from io import StringIO
from unittest.case import TestCase


class Foo(object):
    def save_content(self, filename):
        with open(filename, 'w') as f:
            self.write_content(f)

    def write_content(self, f):
        f.writelines(self.generate_content())

    def generate_content(self):
        for i in range(3):
            yield u"line {}\n".format(i)


class FooTest(TestCase):
    def test_generate(self):
        expected_lines = ['line 0\n', 'line 1\n', 'line 2\n']
        foo = Foo()

        lines = list(foo.generate_content())

        self.assertEqual(expected_lines, lines)

    def test_write(self):
        expected_text = u"""\
line 0
line 1
line 2
"""
        f = StringIO()
        foo = Foo()

        foo.write_content(f)

        self.assertEqual(expected_text, f.getvalue())

    def test_save(self):
        expected_text = u"""\
line 0
line 1
line 2
"""
        foo = Foo()

        filename = 'foo_test.txt'
        try:
            foo.save_content(filename)

            with open(filename, 'rU') as f:
                text = f.read()
        finally:
            os.remove(filename)

        self.assertEqual(expected_text, text)
Don Kirkby
  • 53,582
  • 27
  • 205
  • 286
3

If you can use it, I'd strongly recommend using the following library: http://pyfakefs.org

pyfakefs creates an in-memory fake filesystem, and patches all filesystem accesses with accesses to the fake filesystem. This means you can write your code as you normally would, and in your unit tests, just make sure to initialize the fake filesystem in your setup.

from pyfakefs.fake_filesystem_unittest import TestCase

class ExampleTestCase(TestCase):
    def setUp(self):
        self.setUpPyfakefs()

    def test_create_file(self):
        file_path = '/test/file.txt'
        self.assertFalse(os.path.exists(file_path))
        self.fs.create_file(file_path)
        self.assertTrue(os.path.exists(file_path))

There is also a pytest plugin if you prefer to run your tests with pytest.

There are some caveats to this approach - you can't use it if you're accessing the filesystem via C libraries, because pyfakefs can't use patch those calls. However, if you're working in pure Python, I've found it to be a tremendously useful library.

-1

Based on suggestions I did the following.

class MyTestCase(unittest.TestCase):
    def assertFilesEqual(self, first, second, msg=None):
        first_f = open(first)
        first_str = first_f.read()
        second_f = open(second)
        second_str = second_f.read()
        first_f.close()
        second_f.close()

        if first_str != second_str:
            first_lines = first_str.splitlines(True)
            second_lines = second_str.splitlines(True)
            delta = difflib.unified_diff(first_lines, second_lines, fromfile=first, tofile=second)
            message = ''.join(delta)

            if msg:
                message += " : " + msg

            self.fail("Multi-line strings are unequal:\n" + message)

I created a subclass MyTestCase as I have lots of functions that need to read/write files so I really need to have re-usable assert method. Now in my tests, I would subclass MyTestCase instead of unittest.TestCase.

What do you think about it?

jan
  • 1,593
  • 2
  • 12
  • 15