Reg Exp to extract all files from HTML

Question

Using Regular Expressions I want to extract all links to files or images contained inside some HTML text. Tried several examples but they failed for many reasons (being the main that I'm not skilled at regular expressions :) )

1) First I've tried this:

> Regex("<img[^>]+src=[""']([^""']+)[""']", RegexOptions.Singleline Or
> RegexOptions.IgnoreCase)

(It works OK for images)

2) And then this:

Regex("href=[""']([^""']+)[""']", RegexOptions.Singleline Or RegexOptions.IgnoreCase)

1) extracts all images, it works OK but thats only a partial solution. 2) extracts all href="asdf", but I want to extract only the href pointing to files, I dont want anchors (#middlesection) or .aspx or even url without extensions like href="www.google.com/site"

I want to know how can I extract all files from a given text, being a file any link that ends with a dot and three characters :)

I'm not interested in ".aspx" or ".html", neither in extensionless urls like "id_content=99", nor anchors like "#anchor123".

Is it possible to pack this into one single RegExp? The idea behind all this is that I have to copy every single files referenced in some HTML from one place to another, thus I need an ArrayList containing only the file paths to copy.

Thanks in advance!

Added some sample code just to clarify that is not about "in the wild" html

Giving this code:

<p>This is a paragraph</p>
<br>
<a href="#someplace">Go to someplace</a>
<ul>
    <li><p><a href="../files/document.pdf">Important PDF 1</a></p></li>
    <li><p><a href="../files/document.xls">Important XLS</a></p></li>
</ul>
<a href="content.aspx?id_content=55">Go to content 55</a>
<br>
<img src="../images/nicelogo.jpg">

I want to get this:

"../files/document.pdf"
"../files/document.xls"
"../images/nicelogo.jpg"

I DONT want to get this:

"#someplace"
"content.aspx?id_content=55"

Thats it, with the reg exp that I have, I get all the links, I ONLY want the ones that represents a file. The HTML is written by hand by me (long story) so there will be no strange double-double quotes or malformed tags or strange chars.

I know its possible to do because its almost done, I just dont know how to tell "give me only the matches that have ".something" at the end being "something" a three chars long string". Am I clear? :)

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — MK., Oct 02 '12 at 22:20
I understand that RegExp is not the perfect solution but in this case is not about HTML "in the wild". I write the HTML myself and I know that there will be src="../files/image.jpg" or href="../files/document.pdf" and thats the kind of links I want to extract, in plain english the expression will be: give me those links after src= or href= that ends on a dot and three letters (my definition of file) ignoring everything else :) I could accomplish part of this, I'm missing the "ends in . and three letters" due to lack of RegExp knowledge — Remoto, Oct 03 '12 at 00:55
Yes, I've read everything from the Fermat reference to the HTML Agility Pack (wich I refuse to link to my project since I know that a proper regex will do the job). And did you read the part when I say that this is not "in the wild" HTML but code wrote by myself with NO strange chars and NO funny symbols :) Again, I can capture text between href= or src= quotes, all I want is keep only the ones that ends with a dot and three letters, that is (in my own-controlled-html-not-in-the-wild-world) a file. — Remoto, Oct 03 '12 at 02:46
See, the problem here is that you are contradicting yourself. If doing what you want is easy with regex, then why are you asking for help? — MK., Oct 03 '12 at 02:59
Seriously -- why would you want to use regex over a good library? Your regex will break if you fart next to it and the library will keep working when your html changes significantly in a couple of months. — MK., Oct 03 '12 at 03:07
I'm asking for help because I dont know enough about RegExp to achieve the result I want by myself, otherwise I would not be posting a question here. On the other hand it looks pretty "easy" to me to see that this is the main purpose of this place :) — Remoto, Oct 03 '12 at 03:10

score 1 · Answer 1 · answered Oct 03 '12 at 06:17

1

Based on your examples, the bulk of the expression should not match a question mark, fragment hash or double quote:

"([^?#"]*)\.[a-z]{3,4}"

The last part is to force an extension between 3 and 4 characters preceded by a period.

Edit

To capture the part in between the double quotes:

"(([^?#"]*)\.[a-z]{3,4})"

Not sure how to avoid memory captures on the base name with ASP, in PCRE you would use ?:

answered Oct 03 '12 at 06:17

Ja͢ck

170,779
38
263
309

Jack, I've used the optional 3,4 chars in your answer so you helped me to find the right answer, I need to use a-z0-9 because there are files named like "document20121002.pdf" and avoiding ? and # is no needed because the urls are of three kinds: a link to some place, an anchor or a file with an extension. Thanks for you help! – Remoto Oct 03 '12 at 13:47
@remoto if this answer was useful, consider to up vote it, you have enoug rep for it I think – Ja͢ck Oct 03 '12 at 13:56

score 0 · Answer 2 · edited May 23 '17 at 12:23

You really don't want to try parsing URLs out by yourself. There are all kinds of formats in which resources might be referenced. You could have src=foo with no quotes, src='foo', src="foo", you could have included stylesheets which themselves reference other resources, you need to do entity decoding (src='f"oo') and URL encoding (src='f o o'), and dealing with relative vs absolute URLs (did you know that src='//somesite.com/blah' is different than src='http://somesite.com/blah' and src='somesite.com/blah'?) and so on. And there are the issues that you mentioned, and probably more that I haven't thought of. There are already numerous questions on StackOverflow about why it's a bad idea to try parsing HTML with a RegEx, with answers ranging from the serious to the humorous.

Instead, why not use an existing tool which already solves the problem, like wget? See wget's recursive download support to follow links and crawl a site for referenced resources.

Thanks for you suggestion Brian, but I have full control over the HTML being "parsed", thats because I write it :) so I know for sure that there will be no strange chars or '\\', I have to put this "magical" regexp in a VB forms app that I wrote to update website content. So, simple as it seems, I cant make the right regexp to get all files referenced by any src and href in a given HTML text :) — Remoto, Oct 03 '12 at 00:41

score 0 · Accepted Answer · answered Oct 03 '12 at 03:32

0

Something like this should work:

<a href=\"(.*\.[a-z0-9]{3})\"

but if it does, you have to promise me that you will come back and comment here when you regret that you used regex for this.

answered Oct 03 '12 at 03:32

MK.

33,605
18
74
111

Promised! :) Great, its almost there! But it doesnt work with href="http://www.domain.com/1.htm" so my definition of file must be changed to "ends with a dot and 3 or 4 chars (forgot .xlsx) except for .html .aspx .php" – Remoto Oct 03 '12 at 04:04
Promised and delivered, actually. There is no sane to do that, ok? Just capture the extension and test it in code. – MK. Oct 03 '12 at 04:12
Using you answer and Jack's I finally arrived at the answer wich is: href=\"(.*\.(?:(?:[a-z0-9]{3,4})(?<!htm|html|asp|aspx|php)))\"|src=\"(.*\.(?:(?:[a-z0-9]{3,4})(?<!htm|html|asp|aspx|php)))\" This captures all href's and src's pointing to a file (3 or 4 chars extension avoiding web pages) and at the same time ignores anything that it doesnt ends with an extension (including extenionless urls and also # and ?) THANKS, the programm is working now and I no longer have to manually view the code to collect all the files referenced :) – Remoto Oct 03 '12 at 13:42
@Remoto make sure to come back when it fails and apologize. – MK. Oct 03 '12 at 14:15

Reg Exp to extract all files from HTML

3 Answers3