0

We're using PyYAML version 5.3.1 under Python 3.7.

We're finding that the order of lists is not being preserved.

For example, assume that in the file example.yaml, we have the following ...

---
data:
  - start
  - next
...

And suppose that our Python 3.7 program looks like this:

import yaml

with open('example.yaml', 'r') as f:
    input_data = f.read()
    datadict = yaml.load(input_data, Loader=yaml.FullLoader)
    data = datadict['data']
    print(f'{data}')

When we run this program with the same input data on different machines and in different environments (command line, daemon, REST call, etc.), sometimes it prints out this:

['start', 'next']

... and sometimes it prints out this:

['next', 'start']

It's almost as if YAML is initially storing the list elements in a set and then converting that to a list, because element ordering of a set is not guaranteed. Or perhaps YAML sometimes tries to sort the data that goes into a list.

And we get the same behavior with yaml.SafeLoader instead of yaml.FullLoader.

Also, if we put a print(input_data) statement before the yaml.load statement, we always see the data in the correct order in the output of that print(input_data) statement, although the list ordering set by YAML still varies as described above.

Has anyone seen this behavior? And if so, what could be causing it, and how can it be corrected so that our list ordering can be maintained?

Thank you in advance.

UPDATE: Responding to the latest comments ...

I tried assert data[0] == 'start' as suggested, and it indeed fails during those times when the list ordering fails.

I also tried this:

for item in data:
    print(item)

... and it also prints the items in the same incorrect order when the f-string printout shows the same thing.

Regarding the question of where this code is running: it's within the following Redhat linux environment:

% uname -a
Linux [HOSTNAME] 3.10.0-1160.81.1.e17.x86_64 #1 SMP Thu Nov 24 12:21:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

In one case, this python code is running from the command line, and it always works properly.

In the other case, it's running from a REST server which is resident on the same machine. In this REST-server case, the order of the list is changed to what seems to be alphabetical order.

In both cases, it's Python 3.7.5, and in both cases, it's PyYAML 5.3.1. And yes, I now agree that some subsidiary package that is imported by the REST server python module probably is indeed altering the behavior of `PyYAML'.

But does anyone know what python package could cause the ordering of list elements to be altered? We're using Flask within the REST server, and at first I wondered whether that could be responsible. However, none of the other lists in our software have reordered elements when running within that same module within that same Flask-based REST server.

The large company at which I'm now working has tight controls over the available software libraries that we can use, including python packages. We have to use PyYAML from our company's software repository. And although it's theoretically just an instance of the standard PyYAML 5.3.1 package, perhaps it has been altered in some way by our "Software Security" team. And again, it indeed could be that some subsidiary package used by PyYAML might have been altered at our installation such that it changes lists to ordered lists under certain conditions, or temporarily uses sets to hold list data before converting back to lists.

Anyway, it seems that I'm simply out of luck with the company's PyYAML package, and so I think that my only solution will be to get the source code for ruamel.yaml or some other YAML implementation and include a copy of that source code into the module I'm working on.

Thanks to all of you for all your help and feedback!

I'll keep this question open for a while longer, in case any new information might surface.

PS: The data that is being read via PyYAML is configuration data for a program. Another solution might be to simply abandon YAML altogether here and switch to JSON or to some configuration-management tool.

ChrisGPT was on strike
  • 127,765
  • 105
  • 273
  • 257
HippoMan
  • 2,119
  • 2
  • 25
  • 48
  • 2
    I think ruaml.yaml generally avoids a lot of these types of issues. – flakes Mar 17 '23 at 17:05
  • 1
    I have never seen this behavior, although, my yaml use-cases have been relatively straightforward configuration files. I also suggest trying out `rumael.yaml` which seems to becoming more common and looks much more actively maintained – juanpa.arrivillaga Mar 17 '23 at 17:18
  • 7
    *That being said*, I am actually quite skeptical that PyYaml is not respecting the order. This is a basic requirement of a yaml list type, and if this were not respected, it seems like it would be something that was documented. Wihtout a [mcve], it is really hard to say much at all about what is happening – juanpa.arrivillaga Mar 17 '23 at 17:19
  • Actually, I've never seen this before in `PyYaml`, either, which is why this behavior is a mystery to me and why I'm posting here. Our company is large with lots of software controls, and unfortunately, `ruaml` is not one of the packages that is offered in our available package library. However, I can download `ruaml` from Sourceforge and include its code within our project's code. Maybe that will fix the problem. – HippoMan Mar 17 '23 at 17:23
  • ... and the code I posted in my question already is a minimal reproducible example. The problem is that sometimes it produces the correct list ordering, and sometimes it doesn't, depending on where it is run. – HippoMan Mar 17 '23 at 17:35
  • PyYAML should not do this. But if you see this behaviour, it should not matter whether you use the FullLoader or the SafeLoader, as these share the same routines for most constructs including normal lists. I would be interested if you can reproduce this with (my) ruamel.yaml. Can you add some detail about the OS and environment (normal CPython distribution) and whether you use a virtualenv or an OS installed Python (e.g. if you are using some Linux distribution)? – Anthon Mar 18 '23 at 13:18
  • 1
    Another thing to try is see if this is not a bug in (f-string) printing. You can assert the actual value of the list with `assert data[0] == 'start'`, just to be sure. – Anthon Mar 18 '23 at 13:21
  • 1
    I can't reproduce this on macOS 13.2.1, Python 3.7.1, PyYAML 5.3.1. I ran both SafeLoader and FullLoader versions > 100 times – Anthon Mar 18 '23 at 13:37
  • 1
    Probably, your pyyaml package is corrupted. Or corrupted one of the packages it uses. Or some types or data are corrupted at runtime. – Tsyvarev Mar 19 '23 at 08:53
  • Thanks to all of your for you comments. See the **UPDATE** section in my original question for my responses to your comments. – HippoMan Mar 19 '23 at 16:16
  • 1
    Just to note, if your company is strict about which 3rd-party packages you can use, you will probably be in violation of company policy by copying the source code directly into your project, just as if you somehow managed to get `ruamel.yaml` installed to the company repository without permission and installed it from there. – chepner Mar 19 '23 at 16:19
  • You're probably correct. However, I could write my own yaml parser, and I know for a fact that this is not in violaton. I guess I'd either have to do that, or else swich to JSON or to possibly some sort of configuration package. – HippoMan Mar 19 '23 at 16:50
  • 1) Could the different order be caused by [hash randomization](https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions)? Can you set `PYTHONHASHSEED` to some value and get the same order between runs? Would help for reproducibility. 3) Can you download [pyyaml](https://pypi.org/project/PyYAML/#history) from PyPI and check if it's different from the version you're using? – Nick ODell Mar 19 '23 at 18:46
  • Hashing should not be done when generating lists; only for sets and dicts. Also, the company has strict controls as to what can be downloaded from our servers or uploaded to our servers, so I can't download the company's version of PyYaml to my local machine for the purpose of comparison to the PyPI version, nor am I able to upload the PyPI version to the company's server in order to perform the same comparison. Later I might do an "eyeball" version by looking at the source code of each, but for the moment, I'm trying to get the sysadmins at the company to address this issue. – HippoMan Mar 20 '23 at 14:15

0 Answers0