24

Let's say I want to read a line from a socket, using the standard socket module:

def read_line(s):
    ret = ''

    while True:
        c = s.recv(1)

        if c == '\n' or c == '':
            break
        else:
            ret += c

    return ret

What exactly happens in s.recv(1)? Will it issue a system call each time? I guess I should add some buffering, anyway:

For best match with hardware and network realities, the value of bufsize should be a relatively small power of 2, for example, 4096.

http://docs.python.org/library/socket.html#socket.socket.recv

But it doesn't seem easy to write efficient and thread-safe buffering. What if I use file.readline()?

# does this work well, is it efficiently buffered?
s.makefile().readline()
Bastien Léonard
  • 60,478
  • 20
  • 78
  • 95
  • "Will it issue a system call each time?" Why does this matter? – S.Lott May 04 '09 at 21:04
  • 8
    Because system calls are slow. It's better to fetch a large chunk of data (if available), then process it. Now I know that Python isn't especially fast, and maybe this doesn't really matter. But the documentation says it's better to read by large chunks anyway. – Bastien Léonard May 06 '09 at 06:46
  • 8
    Note that building a string using `+=` is a no-no since it's potentially quadratic, whereas building a list using append the using `str.join` at the end is always linear. – Mike Graham Sep 25 '10 at 04:15
  • @MikeGraham nitpicking, I know, but growing a list will be average linear, save for when the backing array fills up and the new array is allocated and copied over. So yes, linear time, except for the occasional hiccups – Miquel Dec 03 '14 at 11:23
  • 1
    @Miquel, The operation I described will be linear. Appending an item is on usually constant (and is, on average, constant). Occasionally appending an item is linear, but these are spread out in such a way that adding n items is linear. – Mike Graham Dec 03 '14 at 12:39
  • When reading from socket stream, as suggested it is better to use `str.join` than `+=`. However, you can use `bytearray` and have it more readable (but a little slower than `str.join` function). Both are better option than `+=` – Chen A. Aug 21 '17 at 08:27

3 Answers3

30

If you are concerned with performance and control the socket completely (you are not passing it into a library for example) then try implementing your own buffering in Python -- Python string.find and string.split and such can be amazingly fast.

def linesplit(socket):
    buffer = socket.recv(4096)
    buffering = True
    while buffering:
        if "\n" in buffer:
            (line, buffer) = buffer.split("\n", 1)
            yield line + "\n"
        else:
            more = socket.recv(4096)
            if not more:
                buffering = False
            else:
                buffer += more
    if buffer:
        yield buffer

If you expect the payload to consist of lines that are not too huge, that should run pretty fast, and avoid jumping through too many layers of function calls unnecessarily. I'd be interesting in knowing how this compares to file.readline() or using socket.recv(1).

Mathieu Rodic
  • 6,637
  • 2
  • 43
  • 49
Aaron Watters
  • 2,784
  • 3
  • 23
  • 37
  • 1
    `buffer.split("\n", 1)` is not very fast if the buffer is large due to the buffer part in the tuple, it is better to use a `for line in buffer.split("\n"): yield line + "\n"` – MortenB Nov 18 '18 at 02:21
21

The recv() call is handled directly by calling the C library function.

It will block waiting for the socket to have data. In reality it will just let the recv() system call block.

file.readline() is an efficient buffered implementation. It is not threadsafe, because it presumes it's the only one reading the file. (For example by buffering upcoming input.)

If you are using the file object, every time read() is called with a positive argument, the underlying code will recv() only the amount of data requested, unless it's already buffered.

It would be buffered if:

  • you had called readline(), which reads a full buffer

  • the end of the line was before the end of the buffer

Thus leaving data in the buffer. Otherwise the buffer is generally not overfilled.

The goal of the question is not clear. if you need to see if data is available before reading, you can select() or set the socket to nonblocking mode with s.setblocking(False). Then, reads will return empty, rather than blocking, if there is no waiting data.

Are you reading one file or socket with multiple threads? I would put a single worker on reading the socket and feeding received items into a queue for handling by other threads.

Suggest consulting Python Socket Module source and C Source that makes the system calls.

Joe Koberg
  • 25,416
  • 6
  • 48
  • 54
  • I don't really know why I asked about thread-safety, I don't need it in my current project. In fact I want to rewrite a Java program in Python. In Java it's easy to get buffered reading, and I was wondering if Python's socket module provides the same buffering (in fact, I wonder why someone wouldn't want buffering and directly call system calls instead). – Bastien Léonard May 06 '09 at 07:00
  • 1
    realines() is not real-time. so it's useless for interactive TCP services like SMTP, readline seems to work though. – Jasen Sep 23 '16 at 02:52
8
def buffered_readlines(pull_next_chunk, buf_size=4096):
  """
  pull_next_chunk is callable that should accept one positional argument max_len,
  i.e. socket.recv or file().read and returns string of up to max_len long or
  empty one when nothing left to read.

  >>> for line in buffered_readlines(socket.recv, 16384):
  ...   print line
    ...
  >>> # the following code won't read whole file into memory
  ... # before splitting it into lines like .readlines method
  ... # of file does. Also it won't block until FIFO-file is closed
  ...
  >>> for line in buffered_readlines(open('huge_file').read):
  ...   # process it on per-line basis
        ...
  >>>
  """
  chunks = []
  while True:
    chunk = pull_next_chunk(buf_size)
    if not chunk:
      if chunks:
        yield ''.join(chunks)
      break
    if not '\n' in chunk:
      chunks.append(chunk)
      continue
    chunk = chunk.split('\n')
    if chunks:
      yield ''.join(chunks + [chunk[0]])
    else:
      yield chunk[0]
    for line in chunk[1:-1]:
      yield line
    if chunk[-1]:
      chunks = [chunk[-1]]
    else:
      chunks = []
alex
  • 81
  • 1
  • 2