Groovy: CyberNeko | User Agents | Browser Version

Question

I'm currently using CyberNeko in an attempt to grab information I want from a website. However, I believe the website checks the user agent/browser version to keep from just grabbing the url content.

I am aware of using htmlunit to change the browser version, but not sure if I can go about this using CyberNeko.

Does anyone know if it's possible to do such a thing?

Think about this for a moment: If the owner of the site doesn't want people to scrape the page, maybe you shouldn't try to be immoral and circumvent that? I'm sure that if you contact the site owner, he may be more than willing to provide you the data in some other format that doesn't put that much load on the site as scrapers usually do or maybe there's even an API readily available for 3rd parties to use. — Esko, Nov 24 '10 at 09:05
The amount of content I'm looking to grab is tiny. I just prefer not to spend an hour doing a task that could be automated down to much shorter. — StartingGroovy, Nov 30 '10 at 23:22

score 1 · Accepted Answer · answered Nov 24 '10 at 08:38

I've never used CyberNeko, but I thought it was just a HTML parser, i.e. I didn't think you could use it to issue the HTTP requests and actually download the web page.

It could be the fact that the HTTP request issued by CyberNeko is missing various headers such as the user agent header. An easy way to ensure that the HTTP request looks like a request sent from a browser is to use HttpClient instead of CyberNeko to download the web page. There's some example code available here.

Once you've successfully downloaded the page, use CyberNeko to parse out the bits you're interested in.

Yeah, CyberNeko looks like just a parser. I was testing out HttpClient to do the http request which seemed to work fine. I wanted to parse with CyberNeko. Looks like I'll have to break it into two parts instead of one. Thanks Don. — StartingGroovy, Nov 24 '10 at 21:00

Groovy: CyberNeko | User Agents | Browser Version

1 Answers1