5

How to detected no space between attributes. Example:

 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>

correct: 1,3,4 incorrect: 2,5 How to detected incorrect?

I've tried with this:

<(.*?=(['"]).*?\2)([\S].*)|(^/)>

But it's not working.

Phiter
  • 14,570
  • 14
  • 50
  • 84
wroe12
  • 179
  • 1
  • 12

4 Answers4

3

You should not use regex to parse HTML, unless for learning purpose.


http://regexr.com/3cge1

<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>

This regular expression matches even if you don't have any attribute at all. It works for self-closing tags, and if the attribute has no value.


  • <\w+ Match opening < and \w characters.

  • (\s+[\w-]+(=(['"])[^"']*\3)?)* zero or more attributes that must start with a white space. It contains two parts:

    • \s+[\w-]+ attribute name after mandatory space
    • (=(['"])[^"']*\3)? optional attribute value
  • \s*/?> optional white space and optional / followed by closing >.


Here is a test for the strings:

var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

! '<div style="margin:37px;"/></div>'.match(re);
false

! '<span title=\'\'style="margin:37px;" /></span>'.match(re);
true

! '<span title="" style="margin:37px;" /></span>'.match(re);
false

! '<a title="u" hghghgh  title="j" >'.match(re);
false

! '<a title=""gg  ff>'.match(re);
true

Display all incorrect tags:

var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>';
var tagRegex = /<\w+[^>]*\/?>/g;
var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

html.match(tagRegex).forEach(function(m) {
  if(!m.match(validRegex)) {
    console.log('Incorrect', m);
  }
});

Will output

Incorrect <span title=''style="margin:37px;"/>
Incorrect <a title=""gg ff/>
Incorrect <a title="u" hghghgh title="j"example>
Incorrect <a title=""gg ff>

Update for the comments

<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>
Community
  • 1
  • 1
sina
  • 960
  • 2
  • 8
  • 20
1

Try this regex , i think it will work

<\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*>

< - starting bracket

\w* - one or more alphanumeric character

[^=]*= - It will cover all the character till '=' shows up ["'][\w;:]*["'] - this will match two cases 1. one with single quote with having strings optional 2. one with double quote with having strings optional

[\s/]+ - match the space or '\' atleast one occurence

[^>]* - this will match all the character till '>' closing bracket

Mardzis
  • 760
  • 1
  • 8
  • 21
Khan
  • 41
  • 6
1

I got this pattern to work, finding incorrect lines 2 and 5 as you requested:

>>> import re
>>> p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

>>> html = """
 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>
"""

>>> bad = re.findall(p, html)
>>> print '\n'.join(bad)
<span title=''style="margin:37px;" /></span>
<a title=""gg  ff>

regex broken down:

p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

< - starting bracket

[a-z]+\s - 1 or more lowercase letters followed by a space

[a-z]+= - 1 or more lowercase letters followed by an equals sign

[\'\"] - match a single or double quote one time

[\w;:]* - match an alphnumeric character (a-zA-Z0-9_) or a colon or semi-colon 0 or more times

[\"\'] - again match a single or double quote one time

[\w]+ - match an alphanumeric character one or more times(this catches the lack of a space you wanted to detect) ***

.* - match anything 0 or more times(gets rest of the line)

Totem
  • 7,189
  • 5
  • 39
  • 66
1

Not sure about this I am not so experienced at regex but this looks like it is working well

JS Fiddle

<([a-z]+)(\s+[a-z\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))?

Currently <([a-z]+) will mostly work but with web component and <ng-* this would better be \w+

---------------

Output:

<div style="margin:37px;">div</div> correct

<span title=" style="margin:37px;" />span1</span> incorrect

<span title="" style="margin:37px;" />span2</span> correct

<a title="u" title="j">link</a> correct

<a title=""href="" alt="" required>test</a> incorrect

<img src="" data-abc="" required> correct

<input type=""style="" /> incorrect
Mi-Creativity
  • 9,554
  • 10
  • 38
  • 47