11

I was creating a long list of strings like this:

tlds = [
  'com',
  'net',
  'org'
  'edu',
  'gov',
...
]

I missed a comma after 'org'. Python automatically concatenated it with the string in the next line, into 'orgedu'. This became a bug very hard to identify.

There are already many ways to define multi-line strings, some very explicit. So I wonder is there a way to disable this particular behavior?

Reci
  • 4,099
  • 3
  • 37
  • 42
  • 4
    There might be a linter out there that will warn you about this, or you can try adding a rule to your linter to catch this. – hostingutilities.com Oct 12 '21 at 03:12
  • Yes, it’s quite a prevalent issue. I’ve also fallen victim to it once or twice muself. – rv.kvetch Oct 12 '21 at 03:12
  • 4
    This is part of the core syntax as defined [in the documentation](https://docs.python.org/3/reference/lexical_analysis.html#string-literal-concatenation). There is no way to disable this as it would change the syntax in Python. You will need to configure a linter to catch this issue. – metatoaster Oct 12 '21 at 03:13
  • 2
    Also this isn't defining a "multi-line strings", this is simply [String concatenation without '+' operator](https://stackoverflow.com/questions/18842779/). One workaround is to actually define a multiline string and call `.strip().splitlines()` to create the list at runtime for `tlds`. – metatoaster Oct 12 '21 at 03:17
  • 2
    Thanks for the doc. Found this PEP: https://www.python.org/dev/peps/pep-3126/. Also found https://pypi.org/project/flake8-no-implicit-concat/ for linting. – Reci Oct 12 '21 at 03:18
  • @metatoaster I totally agree. I mentioned multi-line string because I thought this feature was originally designed for creating MLS. As for `splitlines()`, since the actual list could be huge, we'd like to avoid the runtime cost. – Reci Oct 12 '21 at 04:02
  • 2
    Any particular reason you're constructing this list of strings in Python, as opposed to reading from a text file? – ddejohn Oct 12 '21 at 04:13
  • Also +1 to @metatoaster 's suggestion. Using a multi-line string and splitting on newlines is what I do when I need this. – ddejohn Oct 12 '21 at 04:15
  • Have you timed `splitlines()`? Parsing the big list statement is going to take some time too, it might be a wash. – Mark Ransom Oct 12 '21 at 04:28
  • 1
    @CrendKing I humbly disagree on your assessment on runtime cost - I created and ran [the following benchmark](https://gist.github.com/metatoaster/146c64ef0c053029a921b0bb934469dc) - the first script generates a random sample of a million strings which are encoded either as an item in a list (`rawlist.py`) or a line within a multiline string (`splitlist.py`). The latter, `splitlist.py` produced the output in 1/10th the time as the list. Just use `splitlines`. (Note that I didn't use `strip()` because the newline characters are known, though it shouldn't make much performance difference). – metatoaster Oct 12 '21 at 04:34
  • @metatoaster Thanks for the script. My test result is at https://pastebin.com/vRPg7uc5, which ran on Windows 11 Python 3.10.0. Looks like `import` vs just run the script shows opposite timing. – Reci Oct 12 '21 at 12:05
  • The `import` test is more realistic given that bytecode that was already generated be used (i.e. most Python programs aside from a single standalone script). You may need to run it again with `__pycache__` directory containing the compiled bytecode. No idea why Python 3.10 on Windows would take longer to compile/run the split file (I used 3.9.5 on Linux). Thinking intuitively though, loading a more varied set of tokens/parsing multiple tokens into an AST should result in a more complex workflow than simply doing an assignment and then split that into a single token and append into a list. – metatoaster Oct 12 '21 at 23:04

2 Answers2

1

The right Platonic thing to do is to modify the linter. But I think life is too short to do so, in addition to the fact that if the next coder does not know about your modified linter, his/her life would be a living hell.

There should not be shame in ensuring that the input, even if hardcoded, is valid. If it was for me, I would implement a manual workaround like so:

tlds = [
  'com',
  'net',
  'org'
  'edu',
  'gov',
]

redone = ''.join(tlds)
chunk_size = 3
tlds = [ redone[i:i+chunk_size] for i in range(0, len(redone), chunk_size) ]

# Now you have a nice `tlds`
print(tlds)

You can forget commas, write two elements on the same line, or even in the same string all you want. You can invite unwary code collabs to mess it too, the text will be redone in threes (chunk_size) later on anyways if that is OK with your application.

EDIT: Later to @Jasmijn 's note, I think there is an alternative approach if we have a dynamic size of entries we can use the literal input like this:

tlds = ['''com
net
org
edu
gov
nl
co.uk''']

# This way every line is an entry by its own as seen directly without any quotations or decorations except for the first and last inputs.
tlds = '\n'.split(tlds)
Bilal Qandeel
  • 727
  • 3
  • 6
0

Why don't you simply wrap the strings into str(…) ?. If you forget a comma, a SyntaxError will be raised.

tlds = [
  str('com'),
  str('net'),
  str('org')
  str('edu'),
  str('gov'),
  str('nl'),
  str('co.uk')
]

If all these str's are too much, you can create a temporary alias:

s = str
tlds = [
  s('com'),
  s('net'),
  s('org')
  s('edu'),
  s('gov'),
  s('nl'),
  s('co.uk')
]

(Note: This code intentionally raises a SyntaxError - to eliminate it, the missing comma needs to be added)

TheEagle
  • 5,808
  • 3
  • 11
  • 39