I am already parsing pages with the HtmlAgilityPack, and getting most img sources. However many websites include img urls in places other than the img src attributes (e.g. inlined javascript, a different attribute, a different element). I would like to cast a slightly wider net and run a regex on the entire html string capture the following in a regex.
- Must begin with http://, https://, //, or /
- Then, any number of valid url path characters
- Must end with either, .jpeg, .jpg, .png, or .gif
I imagine this would be simple to write, however I am not an awesome regexer. I imagine the parts would look like this
- ^((https?\:\/\/)|(\/{1,2}))
- (any ideas?)
- (.(jpe?g|png|gif))$
Can anyone help me fill the blanks?
Thanks
Answer
(https?:)?//?[^\'"<>]+?\.(jpg|jpeg|gif|png)