You are converting a bytes
value to string:
data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
data = data[2:-1]
which means that all line delimiters have been converted to their Python escape codes:
>>> str(b'\n')
"b'\n'"
That is a literal b
, literal quote, literal \
backslash, literal n
, literal quote. You would have to split on r'(\\n|\\r)'
instead, but most of all, you shouldn't turn bytes values to string representations here. Python produced the representation of the bytes value as a literal string you can paste back into your Python interpreter, which is not the same thing as the value contained in the object.
You want to decode to string instead:
if isinstance(data, bytes):
data = data.decode('utf8')
where I am assuming that the data is encoded with UTF8. If this is data from a web request, the response headers quite often include the character set used to encode the data in the Content-Type
header, look for the charset=
parameter.
A response produced by the urllib.request
module has an .info()
method, and the character set can be extracted (if provided) with:
charset = response.info().get_param('charset')
where the return value is None
if no character set was provided.
You don't need to use a regular expression to split lines, the str
type has a dedicated method, str.splitlines()
:
Return a list of the lines in the string, breaking at line boundaries. This method uses the universal newlines approach to splitting lines. Line breaks are not included in the resulting list unless keepends is given and true.
For example, 'ab c\n\nde fg\rkl\r\n'.splitlines()
returns ['ab c', '', 'de fg', 'kl']
, while the same call with splitlines(True)
returns ['ab c\n', '\n', 'de fg\r', 'kl\r\n']
.