2

I'm trying to extract certain URLs from HTML (for example, all that begin with http, contain /tempfiles/ and end in .jpg). I have something like;

http.*?\/tempfiles\/.*?\.jpg

The problem is when I have HTML like;

blah blah <img src=http://somelink/file.html>http://server/tempfiles/blah.jpg
blah blah

It returns http://somelink/file.html etc more junk http://server/tempfiles/blah.jpg

Is there a way to say there must not be a second http between the first and the /tempfiles/?

Lee Tickett
  • 5,847
  • 8
  • 31
  • 55
  • forbid whitespace: https://stackoverflow.com/questions/1181271/regex-to-match-a-single-character-that-is-anything-but-a-space – trollingchar Apr 04 '19 at 20:03
  • Try `http:\/\/[^\/]+\/tempfiles\/(?:[^\/]+\/)*\w+\.\jpg` [demo](https://regex101.com/r/IIxqYc/2) – The fourth bird Apr 04 '19 at 20:09
  • @trollingchar I have tweaked the question a little to show this won't always work. I have found a workaround (I'm actually forbidding quotes) but still want to know the answer as i'm sure there must be something to say "match anything except this"? – Lee Tickett Apr 04 '19 at 20:10
  • @Thefourthbird can you elaborate a tad on what/how yours works? (I haven't tested it, but I can't see anywhere it looks like you're saying don't match http? – Lee Tickett Apr 04 '19 at 20:12
  • @LeeTickett I have updated my comment. Can you try matching the format of the url `http://[^/]+\/tempfiles/(?:[^/]+\/)*\w+\.jpg` [demo](https://regex101.com/r/cxGv7R/1) – The fourth bird Apr 04 '19 at 20:13
  • @Thefourthbird I follow that one slightly better, but still not flexible enough (it broken when i tried changing it to `http://server/diff-folder/tempfiles/blah.jpg` – Lee Tickett Apr 04 '19 at 20:17
  • Then you could use `http://[^/]+(?:/[^/]+)*/tempfiles/(?:[^/]+\/)*\w+\.jpg` [demo](https://regex101.com/r/cxGv7R/2) – The fourth bird Apr 04 '19 at 20:20
  • Or simply `http://[^:]+/tempfiles/(?:[^/]+\/)*\w+\.jpg` perhaps? – iakobski Apr 04 '19 at 20:22
  • Your next problem will be with something like `http://server/tempfiles/blah.txt>http://server/image/blah.jpg` You're so much better off if you just parse the html with a proper parser and then doing regex on the text you extract. – juharr Apr 04 '19 at 21:19
  • @juharr exactly why I was looking for the "not containing http" command... which `Wiktor Stribiżew` has provided in his answer below. – Lee Tickett Apr 04 '19 at 21:21
  • @LeeTickett That will match the whole thing I just put in because there isn't an `http` between the first `http` and the `/tempfiles/`. Then if you make sure there isn't a `http` between the `/tempfiles/` and the `jpg` that would still match this `http://server/tempfiles/blah.txt>blah.jpg`. What you really need is to delimit where things end with whitespace and `>` and `<` which would quickly become a regex that snowballs out of control. – juharr Apr 04 '19 at 21:23
  • @juharr good spot- but I have what I need now to extrapolate so to speak :) `http(?:(?!http).)*?/tempfiles/(?:(?!http).)*?\.jpg` I think would do the trick? – Lee Tickett Apr 04 '19 at 21:27

1 Answers1

2

You may use

http(?:(?!http).)*?/tempfiles/.*?\.jpg

See the regex demo and a Regulex graph:

enter image description here

Details

  • http - a http substring
  • (?:(?!http).)*? - any char other than a newline char, 0 or more repetitions, as few as possible, that does not start a http char sequence
  • /tempfiles/ - a literal substring
  • .*? - any 0+ chars other than newline, as few as possible
  • \.jpg - a .jpg substring.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Legend. Easy when you know how!? I removed the ?: to end up with `http((?!http).)*?/tempfiles/.*?\.jpg` any reason why you had them in there? – Lee Tickett Apr 04 '19 at 21:13
  • @LeeTickett Do not use a capturing group here, it will hamper matching. Use a non-capturing group. Or, compile the regex with `RegexOptions.ExplicitCapture` option. All capturing groups will behaves as non-capturing then. Actually, you may just prepend the pattern with `(?n)` then. – Wiktor Stribiżew Apr 04 '19 at 21:16
  • Just did a quick search. Does using non capturing groups just optimise/speed things up? – Lee Tickett Apr 04 '19 at 21:19
  • 1
    @LeeTickett Yes, and in this case of a tempered greedy token, to a greater extent. – Wiktor Stribiżew Apr 04 '19 at 21:20