2

I'm trying to match <html> tag with optional attributes and to extract those attributes. I want to match one of the following variations of <html> tag. It would be the starting content of a HTML document or there may be DOCTYPE declaration before <html>.

<html>
<html lang="en">
<html class="my-class">
<html class="my-class" lang="en">

The regular expression pattern I'm trying is as below, but it is only matching the last attribute lang="en" for the fourth case.

/<html(\s+([a-z\-]+)=('|")([^"'>]*)('|"))*>/i

Demo

I know that some suggest to use DOM parser instead of regular expression. But I think regular expression is enough for my case as I want to match <html> tag only.

Sithu
  • 4,752
  • 9
  • 64
  • 110

1 Answers1

3

Use the below regex and then get the attribute value pair from group index 1 and 3.

(?:<html|(?<!^)\G)\h*(?:([^=\n\h]+)=(['"])((?:\\\2|(?!\2).)*)\2)?

\G reference.

DEMO

Community
  • 1
  • 1
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Thanks. But I want to match `` without attributes (the first case) too. I want to manipulate especially the values of the attributes `class` and `lang` and leave the other attributes as they are. If there is `class`, I want to append its value and if there is no `class`, I want to add `class`. – Sithu Jan 17 '15 at 05:13
  • Sir, how to be a regex master like you, please give some advise. @AvinashRaj – A l w a y s S u n n y Jan 17 '15 at 05:16
  • @AvinashRaj Awesome! Upvoted. It is better to extract the attribute name and value separately so that the further process on the result could be easy without using string manipulation. – Sithu Jan 17 '15 at 05:20
  • @BeingHuman it's so simple. I started to learn regex (80%) a few months before (approx 10 months). This site teaches me about regex. People here a really awesome. I learned few things from the regex gurus (from SO). Some people clears my doubts. I want to be a person from that some people. Yep, you could ask me any doubts in regex by commenting below my posts. – Avinash Raj Jan 17 '15 at 05:24
  • @Sithu get the attribute from index 1 and value from 3 https://regex101.com/r/kG5vF1/4 – Avinash Raj Jan 17 '15 at 05:27
  • @AvinashRaj Can you please edit your answer with latest update? I will accept. BTW, what is the usage of `~` instead of `/`? – Sithu Jan 17 '15 at 05:31
  • @AvinashRaj Thanks For reply, i'll comment your next regex posts to clear my doubts. – A l w a y s S u n n y Jan 17 '15 at 07:17
  • @AvinashRaj Although it seems working at [regex101.com](https://regex101.com/r/kG5vF1/4), unfortunately it is not working in real life. It's matching the attributes which are not related to ``. [Please check my PhpFiddle example](http://phpfiddle.org/lite/code/2dc507f309222bf89fb3) – Sithu Jan 30 '15 at 17:44
  • @Sithu your input text on this question and input on phpfiddle are different . Please ask this as a new question. – Avinash Raj Jan 31 '15 at 00:54
  • @AvinashRaj Sorry for confusing about my question. My input in the question means to match one of the variations of ``. One of them would be the start content of a HTML document. I would not state all of the HTML contents here. You might misunderstand it. My purpose is to do like in [PhpFiddle.com](http://phpfiddle.org/lite/code/2dc507f309222bf89fb3). When I accepted your answer, it was working in regex101.com, but it is slightly different when I used your pattern with `preg_match_all`. I would not ask another duplicate question, but I will improve my question to avoid misunderstanding. – Sithu Jan 31 '15 at 06:11
  • @AvinashRaj And a lot of white-spaces are being matched and resulted in `$matches`. – Sithu Jan 31 '15 at 06:24