Scraping dynamic content in a website

Question

I need to scrape news announcements from this website, Link. The announcements seem to be generated dynamically. They dont appear in the source. I usually use mechanize but I assume it wouldnt work. What can I do for this? I'm ok with python or perl.

In Perl, `WWW::Mechanize::Firefox` might work (if the page works in Firefox). — choroba, Nov 30 '11 at 09:48
You can refer to the last answer in this thread:http://stackoverflow.com/questions/8323728/scraping-dynamic-content-in-a-website — ichbinblau, Dec 09 '16 at 06:29

score 11 · Answer 1 · answered Nov 30 '11 at 09:53

11

If the content is generated dynamically, you can use Windmill or Seleninum to drive the browser and get the data once it's been rendered.

You can find an example here.

answered Nov 30 '11 at 09:53

jcollado

39,419
8
102
133

Dave Cross · Accepted Answer · 2018-07-17T13:53:58.477

4

The polite option would be to ask the owners of the site if they have an API which allows you access to their news stories.

The less polite option would be to trace the HTTP transactions that take place while the page is loading and work out which one is the AJAX call which pulls in the data.

Looks like it's this one. But it looks like it might contain session data, so I don't know how long it will continue to work for.

edited Jul 17 '18 at 13:53

answered Nov 30 '11 at 10:21

Dave Cross

68,119
3
51
97

How did you identify the file? – Aks Nov 30 '11 at 10:28
Probably firebug or the new web-console in firefox 8.. or similar. – Øyvind Skaar Nov 30 '11 at 10:34
3

I'm old-skool. I used the HTTP Live Headers extension for Firefox. – Dave Cross Nov 30 '11 at 10:37
@ØyvindSkaar I have firebug, could you tell me how to look for it? – Aks Nov 30 '11 at 10:39
@Aks Don't have it installed, but there's "Net" tab or something that can show you all the GET (and POST etc) requests the browser does when loading the page. This includes the ones made by the javascript code. Those probably appear far down. As you use the site, you can see what ajax calls (just GET's, POST's etc) are made. See http://getfirebug.com/network – Øyvind Skaar Nov 30 '11 at 10:46
ok. thanks. btw, will the _t parameter change or can I use it as a link? – Aks Nov 30 '11 at 11:20
No-one can possibly know what the _t parameter is. As I implied above, it might be some kind of session information. I suppose you might get a clue if you examined the Javascript that makes the request. Or, the best solution is probably to ask the site owners to explain the parameters to their AJAX API. You *really* shouldn't just use it without talking to them first. – Dave Cross Nov 30 '11 at 11:26

score 0 · Answer 3 · answered Nov 30 '11 at 10:32

0

There's also WWW::Scripter "For scripting web sites that have scripts" . Never used it.

answered Nov 30 '11 at 10:32

Øyvind Skaar

2,278
15
15

score -8 · Answer 4 · answered Nov 30 '11 at 10:16

-8

In python you can use urllib and urllib2 to connect to a website and collect data. For example:

from urllib2 import urlopen
myUrl = "http://www.marketvectorsindices.com/#!News/List"
inStream = urlopen(myUrl)
instream.read(1024) # etc, in a while loop
# all your fun page parsing code (perhaps: import from xml.dom.minidom import parse)

answered Nov 30 '11 at 10:16

Adam Morris

8,265
12
45
68

Does this handle javascript and similar client-side stuff? – choroba Nov 30 '11 at 11:08
1

Nope. You're better off with jcollado's answer, looking at Windmill or Selenium if you're on python. Haven't used them though. – Adam Morris Nov 30 '11 at 11:21

Scraping dynamic content in a website

4 Answers4

Linked