Regular expression to match ID's and classes in CSS page

Question

I'm trying to analyze HTML code and extract all CSS classes and ID's from the source. So I need to extract whatever is between two quotation marks, which can be preceded by either class or id:

id="<extract this>"

class="<extract this>"

This is the compulsory comment reminding you that you should be using an XML/HTML parser not regex for HTML. — Etheryte, Apr 24 '14 at 18:07
Whatever programming language you are using, be sure to use a parser and not regex. — hwnd, Apr 24 '14 at 18:07
Thank you for your suggestions, but if I wanted to use an HTML Parser, I would have posted that instead. I simply need to extract any classes and ID's from a page, that's all. I'm organizing stylesheets so I want a list of classes and ID's used in the plain HTML source before it gets compiled and jQuery Mobile blows it up with its own custom classes. — eveo, Apr 24 '14 at 18:10
Might be related to: http://stackoverflow.com/a/1732454/464257 — Shaz, Apr 24 '14 at 18:14

score 2 · Answer 1 · edited Apr 24 '14 at 18:12

2

/(?:id|class)="([^"]*)"/gi

replacement expression: $1

this regex in english: match either "id" or "class" then an equals sign and quote, then capture everything that is not a quote before matching another quote. do this globally and case insensitively.

edited Apr 24 '14 at 18:12

Tim Pietzcker

328,213
58
503
561

answered Apr 24 '14 at 18:11

Pat Newell

2,219
2
18
23

nice @Tim! those regexes... they'll get you every time. – Pat Newell Apr 24 '14 at 18:15
I inputted this on http://www.regexr.com/, along with an HTML page at the bottom and it matches the entire "id='id'" instead of just id. Can you verify? http://cl.ly/image/18363j1w1g1V – eveo Apr 24 '14 at 18:21

hwnd · Answer 2 · 2014-04-24T18:21:06.357

Since you prefer using regular expression, here is one way I suppose.

\b(?:id|class)\s*=\s*"([^"]*)"

Regular expression:

\b             # the boundary between a word char (\w) and not a word char
(?:            # group, but do not capture:
  id           # 'id'
 |             # OR
  class        # 'class'
)              # end of grouping
\s*            # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
 =             # '='
 \s*           # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
   "           # '"'
   (           # group and capture to \1:
    [^"]*      # any character except: '"' (0 or more times)
   )           # end of \1
   "           # '"'

score 1 · Accepted Answer · answered Apr 24 '14 at 18:33

1

You may want to try this:

<?php

$css = <<< EOF
id="<extract this>"
class="<extract this>"id="<extract this2>"
class="<extract this3>"id="<extract this4>"
class="<extract this5>"id="<extract this6>"
class="<extract this7>"id="<extract this8>"
class="<extract this9>"
EOF;

preg_match_all('/(?:id|class)="(.*?)"/sim', $css , $classes, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($classes[1]); $i++) {
    echo $classes[1][$i]."\n";
}
    /*
    <extract this>
    <extract this>
    <extract this2>
    <extract this3>
    <extract this4>
    <extract this5>
    <extract this6>
    <extract this7>
    <extract this8>
    <extract this9>
    */
?>

DEMO:
http://ideone.com/Nr9FPt

answered Apr 24 '14 at 18:33

Pedro Lobito

94,083
31
258
268

Exactly what I wanted. I just threw my giant HTML page into the CSS variable, ran it, and it neatly printed every ID and class on that HTML page. Thank you! – eveo Apr 24 '14 at 18:37
Tuga, what does the /sim mean? – eveo Apr 25 '14 at 14:53
`s` modifier: single line. Dot matches newline characters `i` modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z]) `m` modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string) – Pedro Lobito Apr 25 '14 at 14:59

Regular expression to match ID's and classes in CSS page

3 Answers3