-1

I have six html tags I care about checking if the self closing tag is incorrectly done. The tags are: <input/><br/><hr/><img/><link/><meta/>. I'm also looking for it to not have a space before the end tag because the page is xhtml. Basically I want to find ones that DON'T have self closing or if it does there is a space in front of it.

Right now I'm just focusing on one of the tags (input). It picks up some but not all. For instance, it picks up <input type='submit' value='Save'> which it is supposed to do. But it doesn't pick up <input type="text" name="name" id="name"/>. It also picks up correct self closing tags like <input type='submit' value='Save' /></td></tr>

My grep is:

grep "<input(.*[^/])>." *

Any ideas why?

user983223
  • 1,146
  • 2
  • 16
  • 32
  • What about `grep -E "" file.html`?. –  Jan 12 '13 at 16:39
  • @htor - that would give me all self closing tags. I want to find ones that DON'T have self closing or if it does there is a space in front of it. – user983223 Jan 12 '13 at 16:48

3 Answers3

0

Why should it pick up <input type="text" name="name" id="name"/>? That's a correctly closed tag.

Gereon
  • 17,258
  • 4
  • 42
  • 73
  • Forgot to mention in the post and I have corrected it above...I'm also looking for for the tags that don't have a space before the end tag to comply with xhtml. – user983223 Jan 12 '13 at 16:44
0

I think your current regexp isn't working because it's matching the entire line. Just eyeballing it, it looks like you're matching the opening string "<input" then as many characters as you can, with the final character being something other than a /, and then the closing >.

In the case of <input type='submit' value='Save' /></td></tr> since it's greedy, it'll run all the way to the last > that works. Which happens to be the > for the td (since your grep finishes with a .)

As a bit of a hack-y replacement (I'm sure there's a more elegant way to do this..):

grep -P -o "<input.*?(?<=( .)|([^/]))>" test.html

(grep 2.6.3/cygwin if that's of relevance)

which roughly translates: get me anything starting with "<input", then ending with ">" (lazily), then look back and check that either that the 2nd last character before the closing > isn't a space, or that the last character isn't a close slash.

if test.html has (for argument's sake):

<input type='submit' value='Save' /></td></tr>
<input type="text" name="name" id="name"/>
<input type='submit' value='Save'>
<a><input type="blah" /></a>
<input/>
<input></i>

the output is:

<input type='submit' value='Save' />
<input type='submit' value='Save'>
<input type="blah" />
<input>

More generally though, if you're looking to test for compliance with xhtml, would lxml make your life easier?

tanantish
  • 274
  • 1
  • 8
0

Parsing HTML using Regexes is not advisable.

However if your HTML is formatted so that there's only one tag on each line, maybe you can get away with grep '<input' * | grep -v " />"

Community
  • 1
  • 1
Gereon
  • 17,258
  • 4
  • 42
  • 73