1

I'm trying to analyze HTML code and extract all CSS classes and ID's from the source. So I need to extract whatever is between two quotation marks, which can be preceded by either class or id:

id="<extract this>"

class="<extract this>"
eveo
  • 2,797
  • 15
  • 61
  • 95
  • 5
    Use an HTML parser. Don't use regular expressions. – Sean Bright Apr 24 '14 at 18:07
  • This is the compulsory comment reminding you that you should be using an XML/HTML parser not regex for HTML. – Etheryte Apr 24 '14 at 18:07
  • Whatever programming language you are using, be sure to use a parser and not regex. – hwnd Apr 24 '14 at 18:07
  • Thank you for your suggestions, but if I wanted to use an HTML Parser, I would have posted that instead. I simply need to extract any classes and ID's from a page, that's all. I'm organizing stylesheets so I want a list of classes and ID's used in the plain HTML source before it gets compiled and jQuery Mobile blows it up with its own custom classes. – eveo Apr 24 '14 at 18:10
  • Might be related to: http://stackoverflow.com/a/1732454/464257 – Shaz Apr 24 '14 at 18:14
  • Why did you link that @Shaz – eveo Apr 24 '14 at 18:15
  • Which language are you using ? – Pedro Lobito Apr 24 '14 at 18:29

3 Answers3

2
/(?:id|class)="([^"]*)"/gi

replacement expression: $1

this regex in english: match either "id" or "class" then an equals sign and quote, then capture everything that is not a quote before matching another quote. do this globally and case insensitively.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
Pat Newell
  • 2,219
  • 2
  • 18
  • 23
  • nice @Tim! those regexes... they'll get you every time. – Pat Newell Apr 24 '14 at 18:15
  • I inputted this on http://www.regexr.com/, along with an HTML page at the bottom and it matches the entire "id='id'" instead of just id. Can you verify? http://cl.ly/image/18363j1w1g1V – eveo Apr 24 '14 at 18:21
2

Since you prefer using regular expression, here is one way I suppose.

\b(?:id|class)\s*=\s*"([^"]*)"

Regular expression:

\b             # the boundary between a word char (\w) and not a word char
(?:            # group, but do not capture:
  id           # 'id'
 |             # OR
  class        # 'class'
)              # end of grouping
\s*            # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
 =             # '='
 \s*           # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
   "           # '"'
   (           # group and capture to \1:
    [^"]*      # any character except: '"' (0 or more times)
   )           # end of \1
   "           # '"'
hwnd
  • 69,796
  • 4
  • 95
  • 132
1

You may want to try this:

<?php

$css = <<< EOF
id="<extract this>"
class="<extract this>"id="<extract this2>"
class="<extract this3>"id="<extract this4>"
class="<extract this5>"id="<extract this6>"
class="<extract this7>"id="<extract this8>"
class="<extract this9>"
EOF;

preg_match_all('/(?:id|class)="(.*?)"/sim', $css , $classes, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($classes[1]); $i++) {
    echo $classes[1][$i]."\n";
}
    /*
    <extract this>
    <extract this>
    <extract this2>
    <extract this3>
    <extract this4>
    <extract this5>
    <extract this6>
    <extract this7>
    <extract this8>
    <extract this9>
    */
?>

DEMO:
http://ideone.com/Nr9FPt

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • Exactly what I wanted. I just threw my giant HTML page into the CSS variable, ran it, and it neatly printed every ID and class on that HTML page. Thank you! – eveo Apr 24 '14 at 18:37
  • Tuga, what does the /sim mean? – eveo Apr 25 '14 at 14:53
  • `s` modifier: single line. Dot matches newline characters `i` modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z]) `m` modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string) – Pedro Lobito Apr 25 '14 at 14:59