0

Say I have this example string

    <td><a href="/one-two-three/menus" title="test"</td>
<td><a href="/one-two-three/menus/13:00 title="test"</td>
<td><a href="/one-two-three/schedule/could be multiple delimiters/14:00 title="test"</td>

I want to use regex to get 2 results only when the full string starts with /one-two-three and ends with hh:mm. Eg I want to get:

/one-two-three/menus/13:00
/one-two-three/schedule/could be multiple delimiters/14:00

I've tried regex pattern /one-two-three[\s\S]+?[0-9][0-9]:[0-9][0-9]

but this gives

Found 2 matches:
1./one-two-three/menus" title="test"</td>     <td><a href="/one-two-three/menus/13:00
2./one-two-three/schedule/could be multiple delimiters/14:00

I can see why I get the results but my question is what pattern can I use to exclude parts without hh:mm where there can be any number of delimiters between /one-two-three and hh:mm

2 Answers2

2

If the HTML structure is important to you, regex is the wrong approach.

Otherwise (if you can match the string anywhere as long as it's surrounded by "), you might want to try this:

/one-two-three[^"]+?[0-9][0-9]:[0-9][0-9]

[\s\S] basically mean any character. But you only want characters that are not ", because this marks the end of the path.

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Thanks for that. HTML is not important in my case although I have seen that link before. Most of the parsers that get suggested (eg HTML agility pack etc) are fine when dealing with a few pages but far too slow when hundreds or thousands of responses are to be processed. :) –  Sep 27 '12 at 12:37
  • Well yeah, the point of that page is not, that it's "not elegant" to use regex to parse HTML. It's literally impossible, because HTML is not a regular language. It's only possible to solve HTML problems with regex if your problem is very limited in terms of regarding the HTML structure. – Martin Ender Sep 27 '12 at 12:39
0

try

search ".*\"/{one-two-three}{.*}{[0-9][0-9]:[0-9][0-9]}.*"

replace with

\1 = one-two-three \2 = middle parts \3 = hh:mm

if you replace with \1\3 it will eliminate middle portion

Hope this helps :)

Icarus
  • 415
  • 4
  • 12