Why do char strings and byte strings iterate differently?

Question

I've noticed something odd about for loops iterating over Python strings. For char strings, you get strings with a single character each.

>>> for c in 'Hello':
    print(type(c), repr(c))

<class 'str'> 'H'
<class 'str'> 'e'
<class 'str'> 'l'
<class 'str'> 'l'
<class 'str'> 'o'

For byte strings, you get integers.

>>> for c in b'Hello':
    print(type(c), repr(c))

<class 'int'> 72
<class 'int'> 101
<class 'int'> 108
<class 'int'> 108
<class 'int'> 111

Why do I care? I'd like to write a function that takes either a file or a string as input. For a text file/character string, this is easy; you just use two loops. For a string input the second loop is redundant, but it works anyway.

def process_chars(string_or_file):
    for chunk in string_or_file:
        for char in chunk:
            # do something with char

You can't do the same trick with binary files and byte strings, because with a byte string the results from the first loop are not iterable.

Consider the case where your string contains unicode characters. Eg `for c in "•":` — Loocid, Aug 08 '23 at 05:16
@Loocid - I don't understand. Isn't this about the lack of symmetry between `str` and `bytes`. `bytes` has an API much like `str` (istitle, for instance) but when iterated you don't get `bytes` objects. — tdelaney, Aug 08 '23 at 05:19
Will you should how you would call `process_chars()` for each case you are asking about as well as the output (for example, just add a `print(char)` in the body of the loop to get some output. — Code-Apprentice, Aug 08 '23 at 05:23
You may not care, but others certainly care because byte strings don't need to contain letters, like `a = b'\x00lol'`. E.g. if I'm hand crafting a PNG in code, I can do that with b-strings directly, and I can calculate the crc's by iterating over the bytes and working directly with numerical values, even for the parts that "are text". — Mike 'Pomax' Kamermans, Aug 08 '23 at 05:23
@tdelaney Right, I understand the confusion but I still think the design decision makes sense. Iterating a byte string should give you bytes (of which the closest datatype in python is int). Iterating a string should give you characters (of which the closest datatype in python is str). — Loocid, Aug 08 '23 at 05:24
str are sequences of Unicode characters, but bytes are sequences of 8-bit unsigned integers. So you have different type when iterating each. — tekrei, Aug 08 '23 at 05:24
`bytes` is defined as _an immutable sequence of integers_, so that's why it happens. But why it only plays the `str` game in parts, I don't know. — tdelaney, Aug 08 '23 at 05:25
Voting to close as opinion-based because the question seems to be "why do two different types behave differently?". If the question were "how do I make this trick work", I would answer "why do you need a _trick_ when you have `isinstance`?" — jtbandes, Aug 08 '23 at 05:27
@Loocid - but that's the problem. Iterating a `bytes` string doesn't give you `bytes` object. — tdelaney, Aug 08 '23 at 05:27
@jtbandes - I disagree with the close. I think the question is how to process these two things the same. I don't see any solution that doesn't require testing types and doing different things. — tdelaney, Aug 08 '23 at 05:29
If we have to speculate as to what the real question is, maybe it needs editing :) — jtbandes, Aug 08 '23 at 05:31
@tdelaney And if you're iterating a list you don't get a list containing the element back, you just get the element. If anything the bytes behaviour makes more sense (in my opinion). I believe the only reason you get a str while iterating a str is because python lacks a character type. — Loocid, Aug 08 '23 at 05:31
"For a text file/character string, this is easy; you just use two loops." - did you consider just using an iterator that actually iterates the file character by character? Are you actually asking about how to solve the motivating problem, or will you be satisfied if you just understand the reasoning for the design decision? — Karl Knechtel, Aug 08 '23 at 05:32
The fact that iterating over strings gives more strings is in fact really awkward. It leads to weird special cases and bugs, because strings need to be handled differently from any other standard sequence type and most custom sequences. We live with it, because the devs decided it wasn't worthwhile to add a separate character type with the sole purpose of not being a sequence, but it's still a source of problems. — user2357112, Aug 08 '23 at 05:34
@Loocid - There is a clear symmetry between `str` and `bytes`. Things like `istitle` don't really makes sense on bytes. How can bytes be title cased? So, its a partial pattern, but broken on iteration. — tdelaney, Aug 08 '23 at 05:36
@tdelaney I think the addition of the string methods to the bytes class is more of a smell than the iterator returning an int. All the string methods just assume the byte array is ASCII. — Loocid, Aug 08 '23 at 05:43
why not checking type and use `str` or `decode`, if its byte string or file ? — nisakova, Aug 08 '23 at 05:36
@SergeBallesta I'm not sure what you're disagreeing with, the statement you quoted is a fact. The shared methods between `str` and `bytes` like `is_title`, `is_upper`, etc etc all assume the byte array is ASCII. — Loocid, Aug 08 '23 at 06:26
@Loocid Unicode characters would only be a problem in Python 2, in Python 3 they work fine. That was the whole reason they made the painful switch in string types. — Mark Ransom, Aug 08 '23 at 12:44
@KarlKnechtel you caught me, it was really two questions in one. I've already developed a workaround for my test case, but I would have been overjoyed if someone had proposed something better. But understanding for its own sake is good too. — Mark Ransom, Aug 08 '23 at 12:49
@KarlKnechtel I'll try to contain my disappointment. And you should be careful about calling people out by name - because this is my question, I couldn't see the names of the closers like I usually do. — Mark Ransom, Aug 09 '23 at 02:30

Why do char strings and byte strings iterate differently?

0 Answers0