Why Python 3x buffer is larger than bash dd?

Question

I want to copy a big file (>=1GB) to memory:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from subprocess import check_output
from shlex import split

zeroes = open('/dev/zero')

SCALE = 1024

B = 1
KB = B * SCALE
MB = KB * SCALE
GB = MB * SCALE

def ck(str):
    print('{}:\n{}\n'.format(str, check_output(split('free -m')).decode()))

ck('## Before')

buffer = zeroes.read(GB)

ck('## After')

Output:

## Before:
              total        used        free      shared  buff/cache   available
Mem:          15953        7080        6684         142        2188        8403
Swap:          2047           0        2047


## After:
              total        used        free      shared  buff/cache   available
Mem:          15953        9132        4632         142        2188        6351
Swap:          2047           0        2047

Obviously 6684 - 4632 = 2052 MB (which is almost 2x the size of expected 1 GB).

Tests with dd show expected results:

# mkdir -p /mnt/tmpfs/
# mount -t tmpfs -o size=1000G tmpfs /mnt/tmpfs/
# free -m 
              total        used        free      shared  buff/cache   available
Mem:          15953        7231        6528         144        2192        8249
Swap:          2047           0        2047
# dd if=/dev/zero of=/mnt/tmpfs/big_file bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.695143 s, 1.5 GB/s
# free -m 
              total        used        free      shared  buff/cache   available
Mem:          15953        7327        5406        1168        3219        7129
Swap:          2047           0        2047

What's the problem? Why python was 2x as large?

What are the best practices to replicate desired output * in Python 3x?

* Desired output - python uses the same amount of memory as dd.

Re: "desired output" -- what *is* your desired output? If you want to work with huge storage buffers -- in any language -- you're better off using memory-mapped IO. — Charles Duffy, Jan 07 '17 at 20:46
BTW, if your content is a Unicode string instead of a bytestring... well, there's your problem. — Charles Duffy, Jan 07 '17 at 20:49
...as an aside, please *stop* propagating the silly `check_output(split('string with spaces'))` idiom -- it encourages bugs that don't happen in `check_output(['string', 'with', 'spaces'])`, as parameters that are going to be substituted in need to be shell-quoted first. (That is to say, `check_output(split('rm -- %s' % filename))` can delete multiple files if passed a name with spaces or glob characters, whereas `check_output(['rm', '--', filename])` is guaranteed to delete only one). — Charles Duffy, Jan 07 '17 at 20:53
@CharlesDuffy I have never seen a case when `/dev/zero` was an unicode string. — NarūnasK, Jan 07 '17 at 21:12
You're telling Python to interpret it as one. It would do that with *any* file, if you read it with the same code. — Charles Duffy, Jan 07 '17 at 21:13

score 0 · Accepted Answer · edited May 23 '17 at 10:30

0

See How is unicode represented internally in Python?.

Because you aren't specifying that your file is binary, you're reading unicode characters, which require 2-4 bytes per character to store in-memory, even for a codepoint represented as a single byte on-disk.

Use:

zeroes = open('/dev/zero', 'rb') # the 'b' flag is critical here!

...to open your file to read bytestrings.

edited May 23 '17 at 10:30

Community

1
1

answered Jan 07 '17 at 20:56

Charles Duffy

280,126
43
390
441

Since the implementation of PEP 393 in Python 3.3 unicode strings use internally represented in a number of different ways. Depending on requirements, 1, 2 or 4 byte sequences can be used to store a character ([unicode api](https://docs.python.org/3/c-api/unicode.html) ). These are fixed width encodings and I believe correspond to latin-1, USC-2, and UTF-32. There is no need to use a 2 or 4-byte representation when decoding `/dev/zeros` as characters. The data can be adequately stored using latin-1. Try: `s = '\0' * (1 << 20); print(sys.getsizeof(s))`. The result is 1MB + 25 bytes. – Dunes Jan 08 '17 at 12:09

Why Python 3x buffer is larger than bash dd?

1 Answers1