Weird UTF-8 one-liner interpreter bug

Question

So, I have this unholy abomination of a program:

print((lambda raw, name_file: ((lambda start_time, total, lines, names: ((lambda parsed: ('\n'.join(str(10*(parsed[0][name]+parsed[1][name]/2)/total).ljust(6) + name for name in names)))(list(map(lambda x: __import__("collections").Counter(x), map(lambda x: list(map(lambda x: x[1], x)), [list(group[1]) for group in __import__("itertools").groupby(sorted([list(group[1])[0] for group in __import__("itertools").groupby(sorted(list(map(lambda x: [x[3], ' '.join([x[4], x[5], x[6]]), __import__("datetime").datetime.strptime(x[0] + ' ' + x[1], '%Y.%m.%d %H:%M:%S')], map(str.split, filter(lambda x: (any(name.strip() in x for name in names) and "OK ( 0 )" in x), lines))))), lambda x: (x[0], x[1]))], key = lambda x: (x[2], x[1], x[0])), lambda x: ((x[2] < start_time+__import__("datetime").timedelta(days=7)) + (x[2] < start_time+__import__("datetime").timedelta(days=14))))]))))))(__import__("datetime").datetime.strptime(raw.readline().strip(), '%d.%m.%Y %H:%M'), int(raw.readline()), map(lambda x: x.replace("Минчен", ""), raw.readlines()), list(map(str.strip, name_file.readlines())))))(raw = open("test.txt", "r"), name_file = open("names.txt", "r")))

(probably better on pastebin)

It almost works, but the way it does not work is very weird and looks like an interpreter bug to me.

Now, the only non-ASCII characters in the code are in the string "Минчен" in the end, and even then they are perfectly UTF-8-compatible, which is supposed to be the default encoding. Now, the problem is, Python throws this error:

Non-UTF-8 code starting with '\xd1' in file lulz.py on line 1, but no encoding declared;

And it's not just some weird encoding problem! If I remove the last "н" in the string, the program runs just fine; the moment I add any russian letter in it's place, the interpreter crashes. Even if I only add one linebreak before this place, anywhere, just so that this string is on the second line of the source code, the interpreter does not crash.

Of course, I can't provide a Minimal example, given how finicky and unstable this is, but I'm pretty sure this is not the expected behaviour. Is this a bug in the interpreter or am I doing something wrong?

BTW, it may require "names.txt" and "test.txt" to be present; if you want to test, you can create two empty files with these names.

UPD Even adding a space after any single ( makes everything work! Something is definitely wrong here.

UPD2 I am using Python 3.5.1

>>> python3 --version Python 3.5.1

UPD3 here is my file.

UPD4 and here is a hexdump: http://pastebin.com/5R1rbtc3

UPD5 apparently, this problem can only be reproduced on a Mac. I feel like different behaviour on different platforms is not intended.

Can you reproduce this with code that doesn't look like Lisp? — chepner, Oct 12 '16 at 18:03
@chepner that's the problem --- almost any change breaks the bug! I'm trying, but I'm not sure I can — Akiiino, Oct 12 '16 at 18:09
With the file dowloaded from dropmefile: using Python 3.4.4 the script complains about test.txt not found. using Python 2 it complains about encoding, adding it a "coding" declaration makes it complain about the not found test.txt — Tryph, Oct 12 '16 at 18:58
@Tryph Both test.txt and names.txt can be empty; can you please create them and try again? — Akiiino, Oct 12 '16 at 19:00
I got `ValueError: time data '' does not match format '%d.%m.%Y %H:%M'` — Tryph, Oct 12 '16 at 19:50
@Tryph so that means the code actually passes syntax check for you. Are you on Windows? — Akiiino, Oct 12 '16 at 20:16

score 1 · Answer 1 · answered Oct 12 '16 at 18:11

1

The bug is in your expectation of what the default source file encoding is. It is only UTF-8 when you're using Python 3.x (I checked, 3.5 parses the abomination without problems)

Python 2.x defaults to ASCII so add an encoding comment as first line in this abomination and you're good to go

# -*- coding: utf8 -*-

answered Oct 12 '16 at 18:11

Irmen de Jong

2,739
1
14
26

the `-*-` are not necessary – Tryph Oct 12 '16 at 18:18
And, as "python-3.x" tag suggests, I am indeed using python 3.5. Also, if I was using python 2, the program would not work at all, not just in some cases – Akiiino Oct 12 '16 at 18:34
*shrug* then I don't know what you're doing wrong, I downloaded (! not copy-pasted!) the file and ran it directly with python 3.5, no errors. It has to be your editor somehow because you say editing just a single space makes it faulty. I edited the file with sublime text 3 to insert the coding line and it worked fine in python 2.x then as well. – Irmen de Jong Oct 12 '16 at 18:47
@IrmendeJong Umm.. Are you on Linux? Because I'm using Mac; maybe there are some platform differences? – Akiiino Oct 12 '16 at 18:50
I am on windows 7. I downloaded the file from the pastebin. Will try on mac shortly – Irmen de Jong Oct 12 '16 at 18:50
Just checked --- both pastebin and dropmefiles versions give an error for me. This doesn't make sense... – Akiiino Oct 12 '16 at 18:51
Ok a bit unexpected, the file gave the error for me on mac os as well. However, adding the "coding: utf8" comment as a first line again fixed it (both python 2.7 and 3.5 no errors). Used default vi to edit the file. – Irmen de Jong Oct 12 '16 at 19:05

score 0 · Answer 2 · answered Oct 12 '16 at 18:14

0

Characters themselves do not have an encoding - it does not make sense to say a character is UTF-8. UTF-8 is just one of many encodings that can be used to represent a character. You do have non-ASCII characters in your program, and based on the error, the source file is being saved in an encoding other than UTF-8. Because the non-UTF-8 encoding is not declared in the source file, Python does not know what encoding to use instead of UTF-8, resulting in the error. The best solution would be to tell your editor to save the file using UTF-8, but obviously the process for doing so will be specific to your editor.

answered Oct 12 '16 at 18:14

Daniel Harding

1,186
11
14

And how exactly changing _the length of my string_ would affect the encoding that the file is saved in? No, I'm sure that the file is saved in UTF-8 – Akiiino Oct 12 '16 at 18:35
BTW, using the iconv method mentioned in this answer http://stackoverflow.com/a/11021413/3713926, I am getting that the file is definitely valid UTF-8 – Akiiino Oct 12 '16 at 18:38
I answered as best I could with the information you provided. Without having the source file itself, I cannot determine if the problem is in the source file or in Python. – Daniel Harding Oct 12 '16 at 18:43
Maybe try putting a hex dump of the file that gives you an error into a pastebin? – Daniel Harding Oct 12 '16 at 18:59
I've posted the file itself; I'll add the hexdump – Akiiino Oct 12 '16 at 19:01
You are right, the file is definitely valid UTF-8. So my answer is not useful. I tested with Python 3.4.5 on Linux and didn't get the encoding error. Unfortunately I don't have access to a Mac to try to look into it any further. – Daniel Harding Oct 15 '16 at 18:46

Weird UTF-8 one-liner interpreter bug

2 Answers2