What are the basic difference between pickle and yaml in Python?

Question

I am naive to Python. But, what I came to know is that both are being used for serialization and deserialization. So, I just want to know what all basic differences in between them?

Chris Johnson · Answer 1 · 2015-06-11T20:43:34.733

11

YAML is a language-neutral format that can represent primitive types (int, string, etc.) well, and is highly portable between languages. Kind of analogous to JSON, XML or a plain-text file; just with some useful formatting conventions mixed in -- in fact, YAML is a superset of JSON.

Pickle format is specific to Python and can represent a wide variety of data structures and objects, e.g. Python lists, sets and dictionaries; instances of Python classes; and combinations of these like lists of objects; objects containing dicts containing lists; etc.

So basically:

YAML represents simple data types & structures in a language-portable manner
pickle can represent complex structures, but in a non-language-portable manner

There's more to it than that, but you asked for the "basic" difference.

edited Jun 11 '15 at 20:43

answered Sep 19 '13 at 18:00

Chris Johnson

20,650
6
81
80

Thank you and please feel free to refer me more information about pickle & yaml. Like, on what parameters we should pick one of them for data serialization and all(apart from this language portability)? – nirprat Sep 19 '13 at 18:15
@nirprat is serialization/deserialization speed critical? What about readablity, do you need to store those serialized files in a human-readable form? – alecxe Sep 19 '13 at 18:18
The structure of YAML follows the Python concept of indenting; each level is represented by an indent, and there is no closing marker. Compare to XML where starting a block with ``, you should end the same block with ``. YAML is somewhat easier to copy, cut and paste than XML or JSON for this reason. The simplest rule of thumb is, if you are using just primitive data types, choose YAML (or JSON) because they are human-readable, editable and portable; but if you are using non-primitive data types (e.g. Python objects), then you must use Pickle. – Chris Johnson Sep 19 '13 at 18:21
Through API call I am collecting some stats which will be heavily used. So, in my case I am a bit speed concerned and so that I am trying to dump this data into file and cache it instead of dumping it into DB. – nirprat Sep 19 '13 at 18:23
@nirprat if speed matters, consider using `cPickle` instead of pickle, it's much faster. – alecxe Sep 19 '13 at 18:24
That can be a good approach. You can also consider having a separate process migrate the data from your files to database later. If that process would be mediated by Python, then the choice of YAML, Pickle etc. is probably not so important. But if you want that process to be runnable in other ways (e.g. using database native import utility) you would probably get farther with YAML, JSON or XML (or even CSV if the data structure is "flat" records). I'm not aware of any database native toolset that works with Pickle. – Chris Johnson Sep 19 '13 at 18:27

score 5 · Answer 2 · edited May 23 '17 at 12:01

5

pickle is a special python serialization format when a python object is converted into a byte stream and back:

“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy.

The main point is that it is python specific.

On the other hand, YAML is language-agnostic and human-readable serialization format.

FYI, if you are choosing between these formats, think about:

serialization/derialization speed (see cPickle module)
do you need to store serialized files in a human-readable form?
what are you going to serialize? If it's a python-specific complex data structure, for example, then you should go with pickle.

See also:

edited May 23 '17 at 12:01

Community

1
1

answered Sep 19 '13 at 18:02

alecxe

462,703
120
1,088
1,195

So, I have to cache some stats which will be used by other programs for stats manipulation and not concerned about human readability. – nirprat Sep 19 '13 at 18:29
@nirprat if these serialized stats will be used by non-python programs then `pickle` is not a way to go: choose between language agnostic formats: `YAML`, `JSON`, `XML`, `CSV` etc. Take a look at `ujson` and `simplejson` modules - they are quite fast comparing to `json` module. – alecxe Sep 19 '13 at 18:32

score 1 · Answer 3 · answered Oct 01 '21 at 14:20

If it is not important for you to read files by a person, but you just need to save the file, and then read it, then use the pickle. It is much faster and the binaries weigh less.

YAML files are more readable as mentioned above, but also slower and larger in size.

I have tested for my application. I measured the time to upload and download an object to a file, as well as its size.

Serialization/deserialization method	Average time, s	Size of file, kB
PyYAML	1.73	1149.358
pickle	0.004	690.658

As you can see, yaml is 1,67 times heavier. And 432,5 times slower.

P. S. This is for my data. In your case, it may be different. But that's enough for comparison.

What are the basic difference between pickle and yaml in Python?

3 Answers3