0

I'm trying to scrape a website and ingest some of it's content. However the content I need is actually loaded via ajax on the source website and those endpoints are locked to their domain (i get 401 error).

Is there an elegant solution to scraping websites that also allows them to run their JS first? Some kind of small browser wrapper I could call on a cron once per day to get new content?

Appreciate any direction on this :)

Josh Undefined
  • 1,496
  • 5
  • 16
  • 27
  • Calling an API is faster than crawling HTML. There is no such thing as a "domain lock", you most likely forgot to set a cookie. If you can't make it work there are many tools that emulate or drive a web browser (CasperJS, PhantomJS, ZombieJS, Selenium, ...) that you can use to scrap ajax websites without reverse engineering their API – Eloims Sep 04 '15 at 10:21
  • From what I can tell there's an initial ajax call which verifies the origin of the ajax calls and if it's coming from their site it responds with a token which then gives the rest of the calls authorization. I guess you're suggesting there's no such thing as a 'domain lock' as in theory i could masquerade as their domain via headers? – Josh Undefined Sep 04 '15 at 10:24
  • Yes that's what I meant. You can always fake your spider's user-agent, origin header etc. Check what is legal in your country – Eloims Sep 04 '15 at 10:27
  • I'll give it a go :p – Josh Undefined Sep 04 '15 at 10:30
  • Managed to fake all the headers it was accepting and including a valid sessionID seems to allow it through I'm guessing that session will expire at some point so I'll look at a way to automate the request of a new one. – Josh Undefined Sep 04 '15 at 11:33

0 Answers0