Iterate through all filenames/urls in a webpage in Java

Asked Jul 04 '12 at 19:54

Active Jul 04 '12 at 19:54

Viewed 383 times

I'm trying to crawl a webpage in Java, and I need to search the page for URL's and file paths, that could be relative, or absolute. (eg. ../../file.gif or http://hostname.com/file.gif). Not all of these will have html tags around then like <a href>, since some of the file paths may be embedded in some javascript.

If anyone can point me in the right direction that would be fantastic.

asked Jul 04 '12 at 19:54

Brad

10,015
17
54
77

4

If you've read any hits from Google for this, you'll know not to use regex for this, but instead to use an HTML parser such as JSoup. Trying to parse HTML with regex is like trying to drink soup with a fork. Just don't do it. – Hovercraft Full Of Eels Jul 04 '12 at 19:56
3

This might provide some insight: http://stackoverflow.com/questions/677038/how-to-use-regular-expressions-to-parse-html-in-java – tjg184 Jul 04 '12 at 19:56
If the tags are unreliable you might be able to treat the page as a text document and use this: http://stackoverflow.com/a/1806161/1343161 – Keppil Jul 04 '12 at 20:24

Iterate through all filenames/urls in a webpage in Java

0 Answers0