0

I have a proprietary content scheme from which I need to scrape ranges of custom HTML-like tags.

Two examples of these tags are:

<college-point image>/48e1255c8bd8d1c8a6c5d263f7130853.jpg</college-point>
<college-point podcast-episode>704097</college-point>

I had an expression (<college-point\s\w*>([^>]+)>) which worked well for finding tags with one word tag values, like image. When I added podcast-episode I ran into trouble getting the hyphen supported.

I tried something like <college-point[\s\w*]([^>]+)>, but this only returns me the opening of the tag, not the entire thing. What syntax should I be using to allow hyphenated tags?

Ricky
  • 3,101
  • 1
  • 25
  • 33
  • 3
    Obligatory [link](https://stackoverflow.com/a/1732454/62576) about the futility of trying to parse [X]HTML with regular expressions instead of using a proper DOM parser. – Ken White Jul 27 '20 at 03:26
  • 4
    `]*>[^>]+>` should work but you should avoid parsing HTML using regex. – anubhava Jul 27 '20 at 03:55
  • 1
    Ricky, you probably have to stress the difference between your HTML-like input and actual HTML. Explain why you cannot use existing tools. Otherwise you will get more comments not to use regex on it and more links to that (funny) answer. – Yunnosch Jul 27 '20 at 05:06

0 Answers0