3

Whenever a Python object needs to be stored or sent over a network, it is first serialized. I guess the reason is that the storage and network transfer are both based on bits. I have a stupid question, which is more like a computer science foundation question than a python question. What kind of format do python objects take when they are in cache? Shouldn't they represent themselves as bits? If that's the case, why not just use those bits to store or send the object, and why bother with serialization?

dirn
  • 19,454
  • 5
  • 69
  • 74
David Zheng
  • 797
  • 7
  • 21

2 Answers2

3

Bit Representation

The same object can have different representations in Bits on different machines:

  • Think endianness (byte-order)
  • and architecture (32 bits, 64 bites)

So an object representation in Bits on the sender machine could mean nothing, (or worse could mean something else) when received on the receiver.

Take an simple integer, 1025, as an illustration of the problem:

  • On Big Endian machine the Bits representation is:
    • binary: 00000000 00000000 00000100 00000001
    • hexadecimal: 0x00000401
  • while on a Little Endian machine:
    • binary: 00000001 00000100 00000000 00000000
    • hexadecimal 0x01040000

That's why to understand each other, 2 machines have to agree on a convention, a protocol. For the IP protocol, the convention is to use the network byte order (big-endian) for example.

More on endianness in this question

Serialization (and Deserialization)

We can't directly send an object underlying bit representation on the network, for the reasons described before, but not only.

An object can make reference to another object, internally, through a pointer (the in-memory address of this second object). This address is, again, platform-dependent.

Python solves this using a serialization algorithm called pickling that transforms an object hierarchy into a byte-stream. This byte-stream, when sent over a network, is still platform-dependent and that's why a protocol is needed for both ends to understand each other.

Pickle module documentation

Community
  • 1
  • 1
arainone
  • 1,928
  • 17
  • 28
  • Does that mean the python serialization, e.g. the pickle process converts the bits into another format of bits that can be understood universally by all the machines? On a high level, can you shed some light on how this is implemented? – David Zheng Jan 17 '16 at 16:47
  • @David i updated my answer with a brief explaination on Serialization the python way, pickling – arainone Jan 17 '16 at 17:07
  • just did what you said, splitting the comment into two. To understand the idea better, can I make such an analogy. Machines using different bit representation are like two countries using different length unit. For example, country A uses foot, and country B uses passus. Both countries know how to convert bewteen their own units and the commonly know length unit meter. Country A asks country B to manufacture something that's 1 foot long. It's not possible because no one in country B understand how long is 1 foot. – David Zheng Jan 17 '16 at 17:59
  • Continues for the last comment. To make this happen, country A needs to first convert the length into meter (pickling). And then country B can convert that length in meter into passus (unpickling) and start manufacturing. The length unit translations is analogous to the pickling protocol. Do I understand it correctly? – David Zheng Jan 17 '16 at 18:00
  • @David your analogy would be correct for the endianness problem, the meter representing the network byte-order, the meter being an unit both ends knows how to convert to and from. For pickling though, making such an analogy would oversimplify the problem, in particular the fact that pickling converts an object hierarchy into a byte-stream. I suggest you read the [pickle documentation](https://docs.python.org/2/library/pickle.html) for a better understand of what pickle is about – arainone Jan 17 '16 at 18:04
1

The key point for I/O is achieving interoperability, e.g. the JSON you send over network might need to get transferred over HTTP protocol and then parsed by JavaScript. And the data you store on disk might need to be readable the next time you run Python (different runtime environment, memory allocations, ...).

But for code execution, you usually want to achieve higher performance than what would be possible using interoperable formats, e.g. using memory location addresses to access object methods, dict items, ... Or optimizing for processor cache as much as possible.

For details how exactly python is implemented, you can have a look at one of the interpreter implementations though.

Aprillion
  • 21,510
  • 5
  • 55
  • 89