Any regex to replace broken HTML attribute like this?

Question

I am using PHP and would love to make some automated functions which will replace broken HTML attributes like

title="TV 40" is better"

with

title="TV 40&quot; is better"

So, my question is: How can I regex to find the second double quote?

Regex is [not the right tool](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for parsing HTML. I don't even want to imagine what it would be like to parse invalid HTML :-) Fix your HTML at the first place. — Darin Dimitrov, Dec 08 '10 at 13:11
And how would you know if the string has two double quotes instead of one? What I mean is, if this was possible (in a general way) the browsers would have it and this wouldn't be a problem. — willvv, Dec 08 '10 at 13:13
@Darin — it isn't HTML. It is just broken tag soup. An HTML parser would just try to do the best job possible, which would probably involve discarding ` is better"`, which isn't desired. — Quentin, Dec 08 '10 at 13:23
@David, that's why I didn't propose to use an HTML parser but to fix the *tag soup* in the first place, so that you have, well, HTML :-) — Darin Dimitrov, Dec 08 '10 at 13:28

score 1 · Answer 1 · answered Dec 08 '10 at 13:14

1

you could use this instead of Regex

$value = "HTML CODE";
html_entities($value, ENT_QUOTES, 'UTF-8');

I hope this helps you, correct me if im wrong.

answered Dec 08 '10 at 13:14

Wesley

798
3
8
15

score -1 · Answer 2 · edited May 23 '17 at 12:04

I am somewhat confused about what you are trying to accomplish. Maybe a bigger example would be helpful.

Do you have an html document that you wrote that has mistakes in it that you want to fix?
Are you trying to write a program that will fix any broken html?

Some extra information on the context of your question could be helpful.

There are many cases that you might be asking about but in vim this works for me (for the example you provided):

:%s/"\(.*\)"\(.*\)"/"\1\&quot;\2"/g

It will change this:

title="TV 40" is better" title="TV 40" is better"

title="TV of 40 inch, spelled also as, 40" is better"

title="TV 40 is better"

To this:

title="TV 40" is better" title="TV 40&quot; is better

title="TV of 40 inch, spelled also as, 40&quot; is better

title="TV 40 is better"

However it will break something like this (that is already working):

title="TV 40 is better" title="TV 40 is better"

I think as I mentioned before giving us some more context on what you are trying to solve would be helpful.

On a more general note, it is usually a bad idea to try and parse html with regex. There are too many things that can go freakish. Unless you know that the html is going to be in a certain format I would not do it. HTML is not a regular language so it is impossible to parse with regular expressions. The only way that you can get around this is if you know something special about the html. Or you only want to find very specific things in an html page that is formatted in a predetermined way.

According to Jeff Attwod if you try to parse html with regex "you are you're succumbing to the temptations of the dark god Cthulhu's … er … code". See this page.

This answer also gives some good examples of why it is a bad idea to parse html with regex.

Your answer will only work for the provided example. OP asked for a general solution. What if the text is `title="TV of 40 inch, spelled also as, 40" is better"`? — darioo, Dec 08 '10 at 13:15
It would work fine for that example too. It would fail horribly on `title="test" href="foo"` though as it would convert it to `title="test" href="foo"` — Quentin, Dec 08 '10 at 13:41

Any regex to replace broken HTML attribute like this?

2 Answers2