Regex for no space between attributes html

Question

How to detected no space between attributes. Example:

 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>

correct: 1,3,4 incorrect: 2,5 How to detected incorrect?

I've tried with this:

<(.*?=(['"]).*?\2)([\S].*)|(^/)>

But it's not working.

Don't forget http://stackoverflow.com/a/1732454/284111 – Andrew Savinykh Dec 30 '15 at 21:28 — Andrew Savinykh, Dec 30 '15 at 21:28

score 3 · Answer 1 · edited May 23 '17 at 11:44

3

You should not use regex to parse HTML, unless for learning purpose.

http://regexr.com/3cge1

<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*/?>

This regular expression matches even if you don't have any attribute at all. It works for self-closing tags, and if the attribute has no value.

<\w+ Match opening < and \w characters.
(\s+[\w-]+(=(['"])[^"']*\3)?)* zero or more attributes that must start with a white space. It contains two parts:
- \s+[\w-]+ attribute name after mandatory space
- (=(['"])[^"']*\3)? optional attribute value
\s*/?> optional white space and optional / followed by closing >.

Here is a test for the strings:

var re = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

! '<div style="margin:37px;"/></div>'.match(re);
false

! '<span title=\'\'style="margin:37px;" /></span>'.match(re);
true

! '<span title="" style="margin:37px;" /></span>'.match(re);
false

! '<a title="u" hghghgh  title="j" >'.match(re);
false

! '<a title=""gg  ff>'.match(re);
true

Display all incorrect tags:

var html = '<div style="margin:37px;"></div> <span title=\'\'style="margin:37px;"/><a title=""gg ff/> <span title="" style="margin:37px;" /></span> <a title="u" hghghgh title="j"example> <a title=""gg ff>';
var tagRegex = /<\w+[^>]*\/?>/g;
var validRegex = /<\w+(\s+[\w-]+(=(['"]?)[^"']*\3)?)*\s*\/?>/g;

html.match(tagRegex).forEach(function(m) {
  if(!m.match(validRegex)) {
    console.log('Incorrect', m);
  }
});

Will output

Incorrect <span title=''style="margin:37px;"/>
Incorrect <a title=""gg ff/>
Incorrect <a title="u" hghghgh title="j"example>
Incorrect <a title=""gg ff>

Update for the comments

<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>

edited May 23 '17 at 11:44

Community

1
1

answered Dec 30 '15 at 19:48

sina

960
2
8
20

it is working but i want opposite result(pick incorrect) – wroe12 Dec 30 '15 at 20:48
You could pick the incorrect ones using `if(!string.match(re)) { ... }` – sina Dec 30 '15 at 21:10
I edited the tests to return `true` for the incorrect ones. – sina Dec 30 '15 at 21:17
Something is wrong. Try for this example:
– wroe12 Dec 30 '15 at 21:27
So, what are you trying to do eventually with the incorrect matches? Are you trying to fix them (e.g. add a space where needed)? – sina Dec 30 '15 at 21:34
I want to edit my answer to do that, shall I do it in Javascript? – sina Dec 30 '15 at 21:42
I edited my answer for that. Basically we are iterating all tags and validating them using the main regex. – sina Dec 30 '15 at 21:53
Thanks work. This is impossible in one regex or hard? – wroe12 Dec 30 '15 at 22:02
It is not impossible, but it will be unnecessarily complex and ugly. There is no real benefit in using one regex instead of two. – sina Dec 30 '15 at 22:17
I added an update at the end of my answer for your new test string. Please note that there are other cases where it might fail, e.g., a script containing Javascript that includes html in strings. – sina Dec 31 '15 at 18:26
Yes i have problems with js in html, but I try to fix them (another regex) but I have new examples when not work, if u can look: http://paste.ofcode.org/LXikKQG3un68tnpwRbBMKJ – wroe12 Dec 31 '15 at 20:21
How about using a .NET library like suggested in http://stackoverflow.com/a/100393/721215? Doing it by regex is good for learning and for simple inputs (like the ones in your question), but it gets complicated quickly for more general input. – sina Dec 31 '15 at 21:33
Do you want this `
` to be printed as incorrect?
– sina Jan 01 '16 at 19:13
`<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[\w-]+)?)*\s*/?>` – sina Jan 03 '16 at 06:52
Did you test the last regex? – sina Jan 04 '16 at 20:44
Yes it work but not for all, exception: It should be correct – wroe12 Jan 05 '16 at 01:37
`<\w+(\s+[\w-]+(="[^"]*"|='[^']*'|=[^\s'"]+)?)*\s*/?>` – sina Jan 05 '16 at 07:06
Hello, I found new exception, It should be correct. – wroe12 Jan 16 '16 at 11:01
BTW, `placeholder="dfdfdfdf?` is missing the closing double-quote, it is invalid. – sina Jan 16 '16 at 14:25

score 1 · Answer 2 · edited Dec 30 '15 at 21:39

1

Try this regex , i think it will work

<\w*[^=]*=["'][\w;:]*["'][\s/]+[^>]*>

< - starting bracket

\w* - one or more alphanumeric character

[^=]*= - It will cover all the character till '=' shows up ["'][\w;:]*["'] - this will match two cases 1. one with single quote with having strings optional 2. one with double quote with having strings optional

[\s/]+ - match the space or '\' atleast one occurence

[^>]* - this will match all the character till '>' closing bracket

edited Dec 30 '15 at 21:39

Mardzis

760
1
8
21

answered Dec 30 '15 at 19:40

Khan

41
6

Please add an explanation of your answer. Code only answers are rarely useful to someone trying to learn. – RaGe Dec 30 '15 at 20:28
Not working, but i change a little and works:[code]<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+[^>]*(>) – wroe12 Dec 30 '15 at 20:40
i found next example did not work – wroe12 Dec 30 '15 at 20:46
@wroe12 yeah you have to actually , i think flags are not enabled in your case :) – Khan Dec 30 '15 at 20:49

Totem · Answer 3 · 2015-12-30T19:55:59.753

I got this pattern to work, finding incorrect lines 2 and 5 as you requested:

>>> import re
>>> p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

>>> html = """
 <div style="margin:37px;"/></div>
 <span title=''style="margin:37px;" /></span>
 <span title="" style="margin:37px;" /></span>
 <a title="u" hghghgh  title="j" >

 <a title=""gg  ff>
"""

>>> bad = re.findall(p, html)
>>> print '\n'.join(bad)
<span title=''style="margin:37px;" /></span>
<a title=""gg  ff>

regex broken down:

p = r'<[a-z]+\s[a-z]+=[\'\"][\w;:]*[\"\'][\w]+.*'

< - starting bracket

[a-z]+\s - 1 or more lowercase letters followed by a space

[a-z]+= - 1 or more lowercase letters followed by an equals sign

[\'\"] - match a single or double quote one time

[\w;:]* - match an alphnumeric character (a-zA-Z0-9_) or a colon or semi-colon 0 or more times

[\"\'] - again match a single or double quote one time

[\w]+ - match an alphanumeric character one or more times(this catches the lack of a space you wanted to detect) ***

.* - match anything 0 or more times(gets rest of the line)

for this examples works but fail with this: – wroe12 Dec 30 '15 at 20:32 — wroe12, Dec 30 '15 at 20:32

Mi-Creativity · Answer 4 · 2015-12-30T22:38:58.517

Not sure about this I am not so experienced at regex but this looks like it is working well

JS Fiddle

<([a-z]+)(\s+[a-z\-]+(="[^"]*")?)*\s*\/?>([^<]+(<\/$1>))?

Currently <([a-z]+) will mostly work but with web component and <ng-* this would better be \w+

---------------

Output:

<div style="margin:37px;">div</div> correct

<span title=" style="margin:37px;" />span1</span> incorrect

<span title="" style="margin:37px;" />span2</span> correct

<a title="u" title="j">link</a> correct

<a title=""href="" alt="" required>test</a> incorrect

<img src="" data-abc="" required> correct

<input type=""style="" /> incorrect

Regex for no space between attributes html

4 Answers4

http://regexr.com/3cge1

Display all incorrect tags:

Update for the comments

---------------