1

I have an html file and I want to get all the classes from this file in an array using PHP. For example this is my html file:

<div class="main menu">element</div>
<div class="content"></div>

I want to get an array with three elements (in this particular example): "main", "menu", "content".

In bash it is possible to use grep to accomplish this:

classes=($(grep -oP '(?<=class=").*?(?=")' "./index.html"))

How can I do the same in PHP?

I have this basic code at this moment:

//read the entire string
$str = implode("", file('./index.html'));
$fp = fopen('./index.html', 'w');
//Here I guess should be the function to get all of the strings
//now, save the file
fwrite($fp, $str, strlen($str));

Edit: How can my question be the duplicate of the one provided, if I am asking on how find the string using PHP? It is not bash and I have already provided the grep alternative.

Alex
  • 163
  • 1
  • 1
  • 10

3 Answers3

4

I would use php's DOMDocument() class like this:

$classes = array();
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTMLFile('./index.html');
$elements = $dom->getElementsByTagName('*');
foreach($elements as $element) {
    $classes = array_merge($classes,array_filter(explode(' ',$element->getAttribute('class'))));
}
print_r($classes);

Explanation:

  • declare empty array $classes
  • turn off errors DOMDocument might throw if it's incomplete or invalid html
  • instantiate new DOMDocument object
  • load file index.html into DOMDocument
  • get all elements using wildcard tagname
  • iterate over elements
  • get classname
  • explode classname by whitespace
  • filter exploded array to remove empty values
  • add result to $classes array
  • 2
    I agree, generally regex is not the appropriate means for parsing html :] depends if parsing own or arbitrary html and what going to achieve imho. – Jonny 5 Aug 17 '15 at 04:53
  • 1
    Well, thank you for the answer. I have made a research and found it pretty useful. I will explore this topic more. Due to the reason that I have already selected the correct answer, the only thing I can do is to up vote you. Thank you. – Alex Aug 17 '15 at 05:05
4

To get the three elements, try regex like this with preg_match_all function:

(?:class="|\G(?!^))\s*\K[^\s"]+
  • \G continues at end of the previous match or start
  • \K resets beginning of the reported match

See test at eval.in

if(preg_match_all('/(?:class="|\G(?!^))\s*\K[^\s"]+/', $str, $out) > 0)
  print_r($out[0]);

Array ( [0] => main [1] => menu [2] => content )

Note that generally regex is not the appropriate means for parsing html. depends if parsing own or arbitrary html and what going to achieve imho.

Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • parsing html with regex is foolish imo. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. Use DOMDocument or you may anger the gods – But those new buttons though.. Aug 17 '15 at 04:51
  • thanks but there are many regexers much bettter, you only would've needed to add tag 'regex' :] Also regex is generally not recommended for parsing arbitrary html. – Jonny 5 Aug 17 '15 at 04:51
  • 1
    What is the difference if it is html or not? I am editing it as a text file. I will preform the similar operations and with the files that have other extensions. – Alex Aug 17 '15 at 04:54
  • @Alex - you asked about getting classnames from an html document, which is parsing html. there are tools for this - namely `DOMDocument`. Learn it, use it. Read the link i posted above - regex is great, but not the right tool for the job. – But those new buttons though.. Aug 17 '15 at 04:57
1

Depending on what you're trying to do, you can either use regular expressions using the preg_grep function, or you could traverse the DOM using the DOMDocument class.

Jesse Weigert
  • 4,714
  • 5
  • 28
  • 37