What's the Pythonic way to store a data block in a Python script?

Question

Perl allows me to use the __DATA__ token in a script to mark the start of a data block. I can read the data using the DATA filehandle. What's the Pythonic way to store a data block in a script?

Put it in a separate file (module) and import it. Don't do it inline. — agf, Aug 04 '11 at 14:48
@agf - I disagree. Inlining a file-like object using a triple-quoted string wrapped in a StringIO makes for a portable and self-contained test case or demo script. — PaulMcG, Aug 04 '11 at 19:18
The string methods all require the strings to be defined in the file before they are used. The perl __DATA__ section comes after the code. Right? Please let me know if there is a work-around to that. — spazm, May 16 '13 at 02:05
There is a [related thread in Python mailing list](https://mail.python.org/pipermail/python-list/2012-June/625762.html). — Palec, Oct 07 '14 at 00:05

score 11 · Answer 1 · answered Aug 04 '11 at 14:07

11

It depends on your data, but dict literals and multi-line strings are both really good ways.

state_abbr = {
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MS': 'Mississippi',
    'MN': 'Minnesota',
    'MO': 'Missouri',
    }

gettysburg = """
Four score and seven years ago,
our fathers brought forth on this continent
a new nation, 
conceived in liberty
and dedicated to the proposition
that all men are created equal.
"""

answered Aug 04 '11 at 14:07

Ned Batchelder

364,293
75
561
662

2

If it's binary data (ie no bytes and no text) you can include those as well by prefixing the string with b. Ie: b"\x00\x01\x16\x38". That's used by Qt to include resource files for example – Voo Aug 04 '11 at 14:15
4

@Voo: The b prefix doesn't do that. It's ignored in Python 2, and in Python 3 means to create a bytes literal instead of a string (unicode) literal. Binary data can be included as hex escapes into a regular un-prefixed string just fine. – Ned Batchelder Aug 04 '11 at 14:17
Oh right, was in Python3 mode. Sure since "strings" in python 2 aren't unicode to begin with the prefix wouldn't make much sense. But are you really allowed to include illegal unicode codepoints in a python 3 string? That's.. surprising, especially since the conversion from bytes (eg read from a socket) to unicode does indeed check if it makes sense. – Voo Aug 04 '11 at 15:23
1

Indeed. `str = "\x80abc"` works although it contains an illegal utf-8 codepoint while `str = b"\x80abc".decode("utf-8")` fails predictably. What a strange behavior. Seems like the result is just ignored (ie as if you set the errors mode of decode to "ignore") – Voo Aug 04 '11 at 15:32
U+0080 is defined as a C1 control character. Its UTF-8 encoding is `b'\xc2\x80'`. The problem with `b"\x80abc"` is that it's an invalid UTF-8 sequence, a different issue altogether. – chepner Dec 09 '20 at 18:36

score 6 · Answer 2 · answered Aug 04 '11 at 14:33

Use the StringIO module to create an in-source file-like object:

from StringIO import StringIO

textdata = """\
Now is the winter of our discontent,
Made glorious summer by this sun of York.
"""

# in place of __DATA__ = open('richard3.txt')
__DATA__ = StringIO(textdata)
for d in __DATA__:
    print d

__DATA__.seek(0)
print __DATA__.readline()

Prints:

Now is the winter of our discontent,

Made glorious summer by this sun of York.

Now is the winter of our discontent,

(I just called this __DATA__ to align with your original question. In practice, this would not be good Python naming style - something like datafile would be more appropriate.)

NEVER use double underscore names for anything but the standard magic methods. — agf, Aug 04 '11 at 14:48

score 1 · Answer 3 · answered Aug 05 '11 at 15:15

IMO it highly depends on the type of data: if you have only text and can be sure that there is not ''' or """ which micht by any chance be inside, you can use this version of storing the text. But what to do if you want, for example, store some text where it is known that ''' or """ is there or might be there? Then it is adviseable to

either store the data coded in any way or
put it in a separate file

Example: The text is

There are many '''s and """s in Python libraries.

In this case, it might be hard to do it via triple quote. So you can do

__DATA__ = """There are many '''s and \"""s in Python libraries.""";
print __DATA__

But there you have to pay attention when editing or replacing the text. In this case, it might be more useful to do

$ python -c 'import sys; print sys.stdin.read().encode("base64")'
There are many '''s and """s in Python libraries.<press Ctrl-D twice>

then you get

VGhlcmUgYXJlIG1hbnkgJycncyBhbmQgIiIicyBpbiBQeXRob24gbGlicmFyaWVzLg==

as output. Take this and put it into your script, such as in

__DATA__ = 'VGhlcmUgYXJlIG1hbnkgJycncyBhbmQgIiIicyBpbiBQeXRob24gbGlicmFyaWVzLg=='.decode('base64')
print __DATA__

and see the result.

stderr · Answer 4 · 2011-08-04T19:07:43.800

Not being familiar with Perl's __DATA__ variable Google is telling me that it's often used for testing. Assuming you are also looking into testing your code you may want to consider doctest (http://docs.python.org/library/doctest.html). For example, instead of

import StringIO

__DATA__ = StringIO.StringIO("""lines
of data
from a file
""")

Assuming you wanted DATA to be a file object that's now what you've got and you can use it like most other file objects going forward. For example:

if __name__=="__main__":
    # test myfunc with test data:
    lines = __DATA__.readlines()
    myfunc(lines)

But if the only use of DATA is for testing you are probably better off creating a doctest or writing a test case in PyUnit / Nose.

For example:

import StringIO

def myfunc(lines):
    r"""Do something to each line

    Here's an example:

    >>> data = StringIO.StringIO("line 1\nline 2\n")
    >>> myfunc(data)
    ['1', '2']
    """
    return [line[-2] for line in lines]

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Running those tests like this:

$ python ~/doctest_example.py -v
Trying:
    data = StringIO.StringIO("line 1\nline 2\n")
Expecting nothing
ok
Trying:
    myfunc(data)
Expecting:
    ['1', '2']
ok
1 items had no tests:
    __main__
1 items passed all tests:
   2 tests in __main__.myfunc
2 tests in 2 items.
2 passed and 0 failed.
Test passed.

Doctest does a lot of different things including finding python tests in plain text files and running them. Personally, I'm not a big fan and prefer more structured testing approaches (import unittest) but it is unequivocally a pythonic way to test ones code.

What's the Pythonic way to store a data block in a Python script?

4 Answers4

Linked