2

I want to create a long string from a list of objects that contain smaller strings. A simplified example is a chat log:

class Line:
    def __init__(self, user, msg):
        self.user = user
        self.msg = msg

Now I try to create a log:

log = []
for line in lines:
    log.append("{0}: {1}".format(log.user, log.msg))
log_str = "\n".join(log)

On a fast machine I get only around 50000 lines per second (according to tqdm).

Alternatives I tried are just concatenating the string:

log.append(log.user + ": " + log.msg + "\n")

or directly appending it to log_str and both are slower.

As far as I know, concatenation is faster with "\n".join(string_list), but how can I speed up creating the lines?

martineau
  • 119,623
  • 25
  • 170
  • 301
allo
  • 3,955
  • 8
  • 40
  • 71
  • If your current code works, the question might be better suited to https://codereview.stackexchange.com/. – jfaccioni Jan 26 '22 at 20:52
  • 7
    Is 50,000 lines per second too slow? What would be an adequate speed? Can you provide the application with more resources? Honestly, for such a basic task I think a C++ exe or (in linux especially) some existing file mangling command line tool might be a better approach. – JeffUK Jan 26 '22 at 20:54
  • 1
    @jfaccioni, this doesn't look like a review request, so recommending [codereview.se] probably isn't appropriate. – Toby Speight Jan 26 '22 at 20:59
  • @JeffUK The largest file I have to process currently has 2 million lines and is still growing. So it is acceptable but still I think that a modern computer should be able to process it much faster. – allo Jan 27 '22 at 20:13

2 Answers2

9

Do you know about F Strings? they're part of python 3.6+

name = "example"
print(f"any string {name}") 

F strings are evaluated at run time so should be faster than concatenations

You can read more about them here https://stackabuse.com/string-formatting-with-python-3s-f-strings/

Vik
  • 126
  • 5
  • 7
    I did a quick test (which may not mirror the OP's situation accurately enough) that suggests an f-string will be about 2x faster than calling `format`. – chepner Jan 26 '22 at 20:59
  • 2
    See also [this](https://stackoverflow.com/a/48465349/3282436) answer. – 0x5453 Jan 26 '22 at 21:00
5

Options for representing Line, from fastest to slowest:

  • As a tuple ("user", "message").

  • As a namedtuple:

    import collections
    Line = collections.namedtuple("Line", "user, msg")
    line = Line("myuser", "mymsg")
    
  • With __slots__ and a regular class:

    class Line:
      __slots__ = ("user", "msg")
    
      def __init__(self, user, msg):
          self.user = user
          self.msg = msg
    

Fastest way to create the string using the fastest line representation (tuple or namedtuple)1:

log_str = "".join([f"{user}: {message}\n" for user, message in lines])

I don't think you'll be able to go faster than that without resorting to Cython or running on PyPy.

Keep in mind your largest bottleneck is the attribute access and not the string formatting. Attribute access in Python is slow.


1 Yes, the list comprehension is required, and is faster than a generator expression.

Bharel
  • 23,672
  • 5
  • 40
  • 80
  • You can use a generator in place of the list in `join` – mozway Jan 26 '22 at 21:26
  • @mozway nope, will be slower. `"".join()` does 2 passes over the data and internally converts it into a list (first needs to know how many bytes to allocate to the string, and then starts copying). – Bharel Jan 26 '22 at 21:30
  • Weird, I remember having used this to speed up things in the past. I'll redo some timings eventually – mozway Jan 26 '22 at 21:33
  • @mozway `py -m timeit -s "a= range(10000)" "''.join([f'{n}!\n' for n in a])"` , `py -m timeit -s "a= range(10000)" "''.join(f'{n}!\n' for n in a)"`, enjoy :-) – Bharel Jan 26 '22 at 21:35
  • I believe you, but this means the conversion to list would happen twice? I find it strange that this case is not optimized. – mozway Jan 26 '22 at 21:43
  • @mozway conversion to list happens only once, but unlike the genexp, list comprehension doesn't need a conversion at all :-) – Bharel Jan 26 '22 at 21:46
  • @mozway I too had the same reaction when I got to know about it. There's a SO answer: [List vs generator comprehension speed with join function](https://stackoverflow.com/questions/37782066/list-vs-generator-comprehension-speed-with-join-function) and [Ray Henttinger's answer](https://stackoverflow.com/a/9061024) – Ch3steR Jan 27 '22 at 09:31
  • I actually tried first to flatten the storage. But interestingly creating 2 million objects that represent my data is much faster than serializing them into the string. And I think accessing them is then also not the bottleneck. – allo Jan 27 '22 at 20:16