1

I want a script that will scrape a certain web page every hour, and will look for a certain string inside that page.

However, when I enter that page and use `view:source", I cannot see that string in the source. I was told that it's because the string I'm looking for comes from an element that is rendered on the client side (javascript), and thus I can see it only when I manually inspect that element with Chrome console for example.

Which practice / programming language / environment, would be the most efficient to achieve what I want, considering that I want to run that script from my webhost server, which has 2.25GB RAM?

Someone suggested that I will use Pyqt4, but my web-host warned me that this will kill my RAM and hurt server performance. I should note that the script supposed to be very simple, and scrape only a single page, once in an hour.

rockyraw
  • 1,125
  • 2
  • 15
  • 36
  • it can't be easy for you but you can analyze javascript code and do the same in python. if javascript use ajax to get data from server then python script can scrape this data directly form server too. – furas Dec 27 '15 at 11:08

2 Answers2

1

It seems that problem could be solved with PhantomJS, as it mocks real browser's action, which extracts information from client code.

For PhantomJS with Javascript, you may check testing-javascript-with-phantomjs

For how to use PhantomJS with python, please take a look at this

Hope it helps~

Community
  • 1
  • 1
linpingta
  • 2,324
  • 2
  • 18
  • 36
  • That was another suggestion I was offered, when I told my webhost to install it, they told me that they started a building process in order to install, and it crashed (`g++: Internal error: Killed (program cc1plus)`), since the system was out of memory, so I guess I need another solution... – rockyraw Dec 27 '15 at 11:51
  • you mean phantomjs installation caused system out of memory? – linpingta Dec 27 '15 at 12:04
  • I see, so phtonmjs is not useful~ so maybe you could take a look at https://developers.google.com/webmasters/ajax-crawling/docs/learn-more, maybe it helps~ – linpingta Dec 27 '15 at 12:26
0

I cannot see that string in the source

If you only need to fetch one string of the page you might program to do the same what js performs. If JS sends ajax request (GET or POST), you also do it using pure Python thus fetching the missing string.

Suppose in-page script performs the following (NB. code might be in pure JS see here an example):

$.ajax({
  url: "test.html",
  context: document.body
}).done(function() {
  $( this ).addClass( "done" );
});

so in your Python scripting you request the 'test.html' file:

import requests 
base='http://example.com/'
r = requests.get( base + 'test.html')

thus getting the data desired:

print r.headers['content-type']
// 'application/json; charset=utf8'
print r.text
// u'{"data":"<string>"...'
Igor Savinkin
  • 5,669
  • 8
  • 37
  • 69