2

I am attempting to download ~55MB of json data from a webpage with PhanotomJS and python on Windows 10.

The PhantomJS process dies with "Memory exhausted" upon reaching 1GB of memory usage.

The load is made by entering a username and password and then using

myData = driver.page_source

on a page that just contains a simple header and the 55MB of text that makes up the json data.

It dies even if I'm not asking PhantomJS to do anything with it, just get the source.

If I load the page in chrome it takes about a minute to load, and lists it as having loaded 54MB, exactly as expected.

The phantomJS process takes about as long to reach 1GB RAM usage and die.

This used to work perfectly, until recently when the data to be downloaded exceeded about 50MB.

Is there a way to stream the data directly to a file from PhantomJS, or just some setting to not have it explode to 20x the necessary RAM usage? (The computer has 16GB of ram, the 1GB limit is apparently a problem in PhantomJS that they won't fix).

Is there an alternative, equally simple, way of entering a username and password and grabbing some data that doesn't have this flaw? (And does not pop up a browser window while working)

user2711915
  • 2,704
  • 1
  • 18
  • 17
  • 1
    If you don't have to use python, you could try [Nightmare.js](https://github.com/segmentio/nightmare) based on Electron, based on Chromium, which is much more modern. – Vaviloff Jun 13 '17 at 14:07
  • Ok. Not enormously keen to go down the rabbit hole of using node if I can avoid it. I was, perhaps optimistically, hoping there was some way to make the existing code not use 20x the required memory and kill itself... – user2711915 Jun 13 '17 at 16:05
  • Could this be a solution? https://stackoverflow.com/a/28628514/2715393 – Vaviloff Jun 13 '17 at 16:54
  • Thanks Vaviloff, unfortunately I need to go in via a login page first, then I get the data. I can't find out what useful javascript the login button actually calls, despite stepping through the code for about 1000 steps of the ~20,000 line of js it apparently works with. – user2711915 Jun 13 '17 at 17:35
  • I moved to using twill, which although it required using python2, does the job perfectly - even nicer to use than selenium. Alternatives I tried included scrapy, but that seems to require dozens of things to be set up for even the simplest hello world, so it was quicker to implement another solution than read the first 10 pages of the tutorials. – user2711915 Jun 15 '17 at 13:03
  • 1
    Nice alternative, didn't know of this one! Fortunately you didn't need javascript execution, which is the main reason for using PhantomJS. – Vaviloff Jun 16 '17 at 03:43

0 Answers0