0

I'd like do this programmatically:

Given a page URL, I need to get all links on the page. What's important is that at least 3 pieces of link info must be obtained: anchor text, href attribute value, absolute position of the link on the page.

Java CSSBox library is an option, but it's not fully implemented yet(the href attribute value cannot be obtained at the same time and some extra mapping must be done with additional library such as Jsoup). What's more, the CSSBox library renders a page really slow.

It seems that Javascript has all functions available but we have to inject the javascript code into the page and write a driver to take advantage of existing browsers. Scripting languages such as Python and Ruby have support for this as well. It is hard for me to find out the most handy tool.

Terry Li
  • 16,870
  • 30
  • 89
  • 134
  • Why can't it be a solution like this? http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – André Ricardo Oct 18 '12 at 10:16
  • @AndréRicardo Thanks, but how can I get the absolute position of the link? – Terry Li Oct 18 '12 at 10:20
  • then maybe this is what you are looking for, a way to join the base_url and the relative_url http://stackoverflow.com/questions/6499603/python-scrapy-convert-relative-paths-to-absolute-paths – André Ricardo Oct 27 '12 at 15:48

1 Answers1

0

Does PHP's DOM manipulation library help you? http://www.php.net/manual/en/book.dom.php

g13n
  • 3,218
  • 2
  • 24
  • 19
  • If it won't render the page, I don't think it works. I need the absolute position of link element on the page as well. – Terry Li Oct 18 '12 at 03:30
  • @TerryLi sorry I missed that part, may be you could try using http://phantomjs.org/ – g13n Oct 18 '12 at 03:32