0

I'm trying to write some regular expression to only get the classes in an HTML tag.

E.g.

<h1 class="big blue" id="testing"> some text </h1>

I want the regular expression to return big blue. I've been trying to do that but it includes the id as well:

Regular expression: <(.+)?class=\s*"(.+)?"> Testing example: <h1 class="big blue" id="testing"> some text </h1>

https://regex101.com/r/0weyDs/2

samjo
  • 47
  • 1
  • 6
  • 1
    Please don't use regex for html: [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – sshashank124 Jan 12 '20 at 11:54
  • @sshashank124 the reason that I'm doing it is that the HTML is stored in a db as a text. So when I get it from the db, I want to do some regex on it. – samjo Jan 12 '20 at 11:56
  • 2
    Use an html parser – sshashank124 Jan 12 '20 at 11:57
  • @sshashank124 thanks for your suggestion. But, how can I make changes to that HTML if it gets parsed? Can you please refer me to an article or something that you might have read about it? Thanks again – samjo Jan 12 '20 at 12:03
  • I have no idea which language you are using since you haven't included a tag for that. But for python, there is the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) package – sshashank124 Jan 12 '20 at 12:04
  • 1
    Have a look at https://stackoverflow.com/q/3577641/372239 – Toto Jan 12 '20 at 12:34

1 Answers1

1

(I am using JavaScript to do it)

If you are certain there is no " inside the class name of class="abc xyz", then you can use

/<(.+?)class=\s*"([^"]*?)"/g

Example:

([...'<h1 class="big blue" id="testing"> some text </h1><div id="foo" class="blue danube page-title"> some text </div><span class=""></span>'
  .matchAll(/<(.+?)class=\s*"([^"]*?)"/g)].map(arr => arr[2]))

would give

["big blue", "blue danube page-title", ""]

One bug about non-greedy: it is .+? and if you have (.+)? it means match as much as possible and then "optional".

The other concern is you probably want to match class="" as "", so it'd be [^"]* rather than [^"]+

One issue with your orignail regex is that you match the ending >, so it has to match to the end even if you say non-greedy. You can see https://regex101.com/r/0weyDs/3 for

<(.+?)class=\s*"(.+?)"

or https://regex101.com/r/0weyDs/4 for the first regex in the answer.

nonopolarity
  • 146,324
  • 131
  • 460
  • 740