0

I am really new to Regex and still, I am trying to understand the way it works. I am trying to develop a regex to capture name and value from input tag in HTML.

<input type='hidden' name='student' value='9208'>

My idea is to extract the value related to name(student) and the value(9208). I have developed the following regex based on an earlier answer in the stakcoverflow for a previous question.

/<(input)(?:\s+type=([\'"]?)(?<type>[^\'"]*?)\2\s*)?(?:\s+name=([\'"]?)(?<name>[^\'"]*?)\4\s*)?(?:\s+value=([\'"]?)(?<value>[^\'"]*?)\4\s*)?>/m

Above regex is working properly with input like

<input type='hidden' name='student' value='9208'>

But, it is not capturing string if there is no single quotation marks or double quotation marks around the value corresponding with the value attribute (value='9208') eg-

<input type='hidden' name='student' value=9208>

In the above case, it didn't give any matches. Can someone help me to fix the above regex? Thank you

Shota
  • 33
  • 5
  • 4
    You don't generally want to use regex to parse HTML. Instead use an XML parser like [SimpleXML](http://php.net/manual/en/book.simplexml.php). – Alex Howansky Jan 10 '19 at 17:55
  • I will give it a try.Is there any particular reason for not using regex to parse HTML? It seems that it is possible.I am really new to this. – Shota Jan 10 '19 at 18:00
  • 1
    @Shota yes... https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags **TL;DR** *Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.* – Andreas Jan 10 '19 at 18:03
  • @Andreas Thanks,I got the idea – Shota Jan 10 '19 at 18:06
  • If the use case is very simple, you can sometimes get by with regex and you'll be fine. However, HTML as a whole can not be handled by regex alone, and once you get past very simple patterns, it's just simpler and safer to use an XML parser. – Alex Howansky Jan 10 '19 at 18:07
  • @Alex Howansky I will give it a try. Thanks for the information – Shota Jan 10 '19 at 18:12

2 Answers2

0

There is a small problem in your regex.

<(input)(?:\s+type=([\'"]?)(?<type>[^\'"]*?)\2\s*)?(?:\s+name=([\'"]?)(?<name>[^\'"]*?)\4\s*)?(?:\s+value=([\'"]?)(?<value>[^\'"]*?)\4\s*)?>

Here if you notice, in your this regex part (?<value>[^\'"]*?)\4\s*) you are using \4 as closing quote or double quote, but \4 is the capture group in name regex part (?<name>[^\'"]*?)\4\s*), hence if value attribute is also enclosed by the same character, doublequote or single quote, then your regex will match fine but if name attribute's value is enclosed by something different than value attribute part, then your regex will simply not match.

So you just need a little correction and make it \6 in (?<value>[^\'"]*?)\4\s*) part and your regex will start matching like you expected.

Here is the correct regex you should use,

<(input)(?:\s+type=([\'"]?)(?<type>[^\'"]*?)\2\s*)?(?:\s+name=([\'"]?)(?<name>[^\'"]*?)\4\s*)?(?:\s+value=([\'"]?)(?<value>[^\'"]*?)\6\s*)?>

Demo

Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
  • 1
    thank you very much for the support and it means a lot to me – Shota Jan 10 '19 at 18:44
  • Pleased to help brother :) – Pushpesh Kumar Rajwanshi Jan 10 '19 at 18:44
  • I ran into a small problem while using the above regex. When I run the following code on my local server, it is not working as expected.But,it is working properly when I check it on regular expressions 101. – Shota Jan 16 '19 at 03:20
  • `$html_1 =""; $re_1 = '<(input)(?:\s+type=([\'"]?)(?[^\'"]*?)\2\s*)?(?:\s+name=([\'"]?)(?[^\'"]*?)\4\s*)?(?:\s+value=([\'"]?)(?[^\'"]*?)\6\s*)?>'; preg_match_all($re_1, $html_1, $matches_1); $var_1=sizeof($matches_1); $count_1 = count($matches_1[0]);` When I print the contetnt of the $matches_1 array,values filelds are not included.But,in regex 101 site,it shows that the variables are included – Shota Jan 16 '19 at 03:27
0

I agree with comments to your post that using regex to parse HTML in not a good idea. But it is still possible, although it requires you to be very accurate and observant.

In your case the regex can be as follows (for readability I divided it into chunks):

  • <(input) - < and the first capturing group, matching the tag name.
  • (?:\s+type=([\'"]?)(?<type>[^\'"]+)\2)? - The part for type attribute.
  • (?:\s+name=([\'"]?)(?<name>[^\'"]+)\4)? - The part for name attribute.
  • (?:\s+value=([\'"]?)(?<value>[^\'"]+)\6)? - The part for value attribute.
  • \s*> - A sequence of spaces and > terminating the tag.

Your failure is that in the part concerning value you called group No 4, but you should have called group No 6.

Another correction is that if the next group starts with \s+ then the current group dous not need to end with \s* (as you did).

For a working example see https://regex101.com/r/IOLKTV/1

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41