0

With preg_match_all I want to get class and data-attributes in html.

I asked a similar question before. The correct answer to the previous responsibility was done with DOM. But as an alternative to the DOM structure, I also need a regex version.

The pattern works fine. However, if the lines are side-by-side, they also take class names from tags that should not be accepted.

<div class="noproblem"> 
    <ul class="noproblem" data-ss="1">
        <li class="noproblem" data-ss="1">
            <!-- <i> is not my tag. but there s no problem with that. because it s underneath . -->
            <i class="no_problem"></i>
        </li>
    </ul>
</div>

<div class="noproblem" data-ss"1">  <!-- problem: data-ss is not accepted -->
    <ul class="noproblem" data-ss="1">
        <!-- <i> is not my tag. my tags:  div|ul|li . -->
        <li class="noproblem"><i class="this_is_problem"></i>
        </li>
    </ul>
</div>

<div class="noproblem">
    <ul class="noproblem">
        <!-- <i> is not my tag. my tags:  div|ul|li . -->
        <li class="noproblem"><i class="this_is_problem"></i>
        </li>
        <!-- <span> is not my tag. my tags:  div|ul|li . -->
        <li class="test"><span class="this_is_problem"></span></li>
        <!-- (li class empty version): <span> is not my tag. my tags:  div|ul|li . -->
        <li><span class="this_is_problem"></span></li>
    </ul>
</div>

Regex pattern:

$pattern = '/<(?:div|ul|li)(?:.*?(?:class|data-ss)="([^"]+)")?(?:.*?(?:class|data-ss)="([^"]+)")?[^>]*>/'; 

Examples and problems: https://regex101.com/r/vSIsac/5

Alternative source (my old question): https://stackoverflow.com/a/51778865/6320082

VLAZ
  • 26,331
  • 9
  • 49
  • 67
Mert Aşan
  • 366
  • 1
  • 6
  • 18
  • 1
    This question needs to stand on it's own merits. You should really put the relevant code and clear question here, instead of linking to old/other questions asking for a different thing based on them – James Aug 10 '18 at 22:19
  • I understand, but the question in the link I refer to is also mine. I did not want to ask the same question again and flood it. That's why I gave it as a reference url. I clearly defined the subject there. Sorry. :) – Mert Aşan Aug 10 '18 at 22:24
  • @MertA. Every question and answer on SO is meant to serve the community as a whole, not just you as the original poster. We have some rules in place to keep the quality high, which means that every question should stand on it's own and should be answerable without visiting any links. (The links can be there as a reference.) – Ivar Aug 10 '18 at 22:31
  • 1
    Why do you need regex as an alternative for a DOM parser? – Ivar Aug 10 '18 at 22:32
  • "*I did not want to ask the same question again / the question in the link I refer to is also mine*" either this question is a dupe of the other one and thus needs closing as a dupe, or this is a *new* question and thus should stand on it's own merits and information. Consider your other question gets closed/deleted in the future, how would this question then be any good or use to anyone? – James Aug 10 '18 at 22:40
  • Your problem is trying to use regular expressions to parse HTML. ***Use a parser like DOMDocument.*** – Sammitch Aug 10 '18 at 22:45
  • @Ivar You are both right. Sorry. I updated the query. Thank you for warning. – Mert Aşan Aug 10 '18 at 22:49
  • @James You are both right. Sorry. I updated the query. Thank you for warning. – Mert Aşan Aug 10 '18 at 22:49
  • @Ivar ; Because the dom html page can not get everything from within. I could not make it. I am creating css and html class / id encryption project by php. (the jQuery version is also compatible): screenshots : url1: https://ibb.co/bUpgJ9 url2: https://ibb.co/gqO8y9 – Mert Aşan Aug 10 '18 at 22:49
  • That's not encryption, it's obfuscation, and why on earth would you want to do that? – Sammitch Aug 11 '18 at 00:00
  • @Sammitch Because I created a very expensive special software. Most of code them working with javascript/jquery. I'm trying to make it as complicated as possible. This method (obfuscation) is also used on facebook and whatsapp web. (I'm using google translate. That's why I'm sorry for word errors.) – Mert Aşan Aug 11 '18 at 00:06

1 Answers1

1

If you really need to use regexes, try with this:

<(?:div|ul|li)(?=[^>]*\bclass="([^"]+)")(?=(?:[^>]*\bdata-\w+="([^"]+)")?)

You'll get class value on first captusing group ($1) and data value (if exists) on second capturing group ($2)

Demo

Explained:

<(?:div|ul|li)  # div or ul or li tag

 # Lookahead expressions:

 # find any character not '>' repeated any times, then class
 (?= # lookahead
    [^>]*\bclass="([^"]+)"
 )  

 # find any character not '>' repeated any times, then data
 # Since this is optional, we make the whole expression optional with ?
 (?=
    (?:
        [^>]*\bdata-\w+="([^"]+)"
    )? # optional
 )
Julio
  • 5,208
  • 1
  • 13
  • 42
  • Thank you. For those who have the same problem, both this answer and the answer is correct. https://stackoverflow.com/questions/51778425/regex-to-find-html-div-class-content-and-data-attr-preg-match-all/51778865?noredirect=1#comment90547307_51778865 – Mert Aşan Aug 10 '18 at 23:26
  • I have completed the my project with your help. I will post it on github soon. Thank you again :) Example: https://ibb.co/iiQr1U – Mert Aşan Aug 10 '18 at 23:34
  • How do I get
    output with $0 (full match) ? Examples (in loop): $0 output:
    $0 output:
    – Mert Aşan Aug 11 '18 at 00:50
  • 1
    Just add `[^>]+` at the end of the regex. Like this: `<(?:div|ul|li)(?=[^>]*\bclass="([^"]+)")(?=(?:[^>]*\bdata-\w+="([^"]+)")?)[^>]+` – Julio Aug 11 '18 at 00:55