grep false positives

Question

I have six html tags I care about checking if the self closing tag is incorrectly done. The tags are: <input/><br/><hr/><img/><link/><meta/>. I'm also looking for it to not have a space before the end tag because the page is xhtml. Basically I want to find ones that DON'T have self closing or if it does there is a space in front of it.

Right now I'm just focusing on one of the tags (input). It picks up some but not all. For instance, it picks up <input type='submit' value='Save'> which it is supposed to do. But it doesn't pick up <input type="text" name="name" id="name"/>. It also picks up correct self closing tags like <input type='submit' value='Save' /></td></tr>

My grep is:

grep "<input(.*[^/])>." *

Any ideas why?

@htor - that would give me all self closing tags. I want to find ones that DON'T have self closing or if it does there is a space in front of it. — user983223, Jan 12 '13 at 16:48

score 0 · Answer 1 · answered Jan 12 '13 at 16:39

0

Why should it pick up <input type="text" name="name" id="name"/>? That's a correctly closed tag.

answered Jan 12 '13 at 16:39

Gereon

17,258
4
42
73

Forgot to mention in the post and I have corrected it above...I'm also looking for for the tags that don't have a space before the end tag to comply with xhtml. – user983223 Jan 12 '13 at 16:44

score 0 · Answer 2 · answered Jan 12 '13 at 17:54

I think your current regexp isn't working because it's matching the entire line. Just eyeballing it, it looks like you're matching the opening string "<input" then as many characters as you can, with the final character being something other than a /, and then the closing >.

In the case of <input type='submit' value='Save' /></td></tr> since it's greedy, it'll run all the way to the last > that works. Which happens to be the > for the td (since your grep finishes with a .)

As a bit of a hack-y replacement (I'm sure there's a more elegant way to do this..):

grep -P -o "<input.*?(?<=( .)|([^/]))>" test.html

(grep 2.6.3/cygwin if that's of relevance)

which roughly translates: get me anything starting with "<input", then ending with ">" (lazily), then look back and check that either that the 2nd last character before the closing > isn't a space, or that the last character isn't a close slash.

if test.html has (for argument's sake):

<input type='submit' value='Save' /></td></tr>
<input type="text" name="name" id="name"/>
<input type='submit' value='Save'>
<a><input type="blah" /></a>
<input/>
<input></i>

the output is:

<input type='submit' value='Save' />
<input type='submit' value='Save'>
<input type="blah" />
<input>

More generally though, if you're looking to test for compliance with xhtml, would lxml make your life easier?

score 0 · Answer 3 · edited May 23 '17 at 12:04

0

Parsing HTML using Regexes is not advisable.

However if your HTML is formatted so that there's only one tag on each line, maybe you can get away with grep '<input' * | grep -v " />"

edited May 23 '17 at 12:04

Community

1
1

answered Jan 12 '13 at 17:55

Gereon

17,258
4
42
73

grep false positives

3 Answers3