4

After a lot of investigating, I found out that after serving hundreds of thousands of HTTP POST requests, there's a memory leak. The strange part is that the memory leak only occurs when using PyPy.

Here's an example code:

from twisted.internet import reactor
import tornado.ioloop

do_tornado = False
port = 8888

if do_tornado:
    from tornado.web import RequestHandler, Application
else:
    from cyclone.web import RequestHandler, Application

class MainHandler(RequestHandler):
    def get(self):
        self.write("Hello, world")

    def post(self):
        self.write("Hello, world")

if __name__ == "__main__":
    routes = [(r"/", MainHandler)]
    application = Application(routes)

    print port
    if do_tornado:
        application.listen(port)
        tornado.ioloop.IOLoop.instance().start()
    else:
        reactor.listenTCP(port, application)
        reactor.run()

Here is the test code I am using to generate requests:

from twisted.internet import reactor, defer
from twisted.internet.task import LoopingCall

from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.iweb import IBodyProducer

from zope.interface import implements

pool = HTTPConnectionPool(reactor, persistent=True)
pool.retryAutomatically = False
pool.maxPersistentPerHost = 10
agent = Agent(reactor, pool=pool)

bid_url = 'http://localhost:8888'

class StringProducer(object):
    implements(IBodyProducer)

    def __init__(self, body):
        self.body = body
        self.length = len(body)

    def startProducing(self, consumer):
        consumer.write(self.body)
        return defer.succeed(None)

    def pauseProducing(self):
        pass

    def stopProducing(self):
        pass


def callback(a):
    pass

def error_callback(error):
    pass

def loop():
    d = agent.request('POST', bid_url, None, StringProducer("Hello, world"))
    #d = agent.request('GET', bid_url)
    d.addCallback(callback).addErrback(error_callback)


def main():
    exchange = LoopingCall(loop)
    exchange.start(0.02)

    #log.startLogging(sys.stdout)
    reactor.run()

main()

Note that this code does not leak with CPython nor with Tornado and Pypy! The code leaks only when using Twisted and Pypy together, and ONLY when using a POST request.

To see the leak, you have to send hundreds of thousands of requests.

Note that when setting PYPY_GC_MAX, the process eventually crashes.

What's going on?

Ron Reiter
  • 3,852
  • 3
  • 30
  • 34
  • 1
    What did you observe that makes you think there's a memory leak? Are you sure the size of the program on PyPy isn't just larger than the size on CPython? – Jean-Paul Calderone Jan 11 '14 at 17:18
  • It's hard to be 100% certain, but you can go ahead and double-check my statements. The tests were done over a very long period of time. For example, answering hundreds of thousands GET request with Cyclone maxes out the memory used by the process to 63 megabytes, whereas using the exact same test with a POST request caused the process to use 400 megabytes and keep taking more and more memory. We are running production servers that handle thousands of requests per second with Cyclone code that run for days with no memory issues using Python 2.7. This problem only happens with PyPy. – Ron Reiter Jan 12 '14 at 03:20
  • 1
    Using the packages in Debian unstable, the behavior I get is that PyPy starts up using 120MB, goes to 125MB after the first 1000 requests, and stays at that memory usage level for the next 5000 requests. I didn't let it run into the hundreds of thousands of requests since memory usage appeared to reach a stable level after only a thousand requests. Perhaps the version of PyPy or Cyclone (or another dependency) packages in Debian unstable right now is newer or older than the versions you tested with, and a problem has been fixed or introduced? – Jean-Paul Calderone Jan 12 '14 at 15:28
  • 1
    Oh. I also had to fix the client program to call `Response.deliverBody` in `callback` otherwise after the first two requests it stops. – Jean-Paul Calderone Jan 12 '14 at 15:28
  • 2
    I didn't notice any leaks. Can you please describe in more detail how does it actually leak? It's normal for PyPy to consume more memory over time when it compiles JIT code (it stops after a while). Btw it really belongs as a bug report, not as a question on stackoverflow. – fijal Jan 13 '14 at 09:23
  • it doesn't really make sense that this is a memory leak - that's why it's a Stackoverflow question. You have to send much more than 5000 requests - try 1 million to see the leak. Eventually, the process will take 100% of the memory of the machine it's running on, and crash if PYPY_GC_MAX is set. – Ron Reiter Jan 16 '14 at 19:06

1 Answers1

1

Turns out that the cause of the leak is the BytesIO module.

Here's how to simulate the leak on Pypy.

from io import BytesIO
while True: a = BytesIO()

Here's the fix: https://bitbucket.org/pypy/pypy/commits/40fa4f3a0740e3aac77862fe8a853259c07cb00b

Ron Reiter
  • 3,852
  • 3
  • 30
  • 34