0

I have a web application with an upload functionality for HTML files generated by chess software to be able to include a javascript player that reproduces a chess game.

I do not like to load the uploaded files in a frame so I reconstruct the HTML and javascript generated by the software by parsing the dynamic parts of the file.

The problem with the HTML is that all attributes values are surrounded with an apostrophe instead of a quotation mark. I am looking for a way to fix this using a library or a regex replace using c#.

The html looks like this:

<DIV class='pgb'><TABLE class='pgbb' CELLSPACING='0' CELLPADDING='0'><TR><TD>

and I would transform it into:

<DIV class="pgb"><TABLE class="pgbb" CELLSPACING="0" CELLPADDING="0"><TR><TD>
jhoefnagels
  • 359
  • 1
  • 7
  • 21

2 Answers2

1

I'd say your best option is to use something like HTML Agility Pack to parse the generated HTML, and then ask it to re-serialize it to string (hopefully correcting any formatting problems in the process). Any attempt at Regexes or other direct string manipulation of HTML is going to be difficult, fragile and broken...


Example (when your HTML is stored in a file on the hard disk):

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
doc.Save("file.htm");

It is also possible to do this directly in memory from a string or Stream of input HTML.

Community
  • 1
  • 1
Mark Pim
  • 9,898
  • 7
  • 40
  • 59
  • Hi Mark, I have tried this option but it leaves the single quotes, rpobably because it is valid html. It did fix the tag casing though. – jhoefnagels Nov 25 '11 at 12:40
0

you could use something like:

string ouputString = Regex.Replace(inputString, @"(?<=\<[^<>]*)\'(?=[^<>]*\>)", "\"");

Changed it after Oded's remark, this leaves the body HTML intact. But I agree, Regex is a bad idea for parsing HTML. Mark's answer is better.

TomL
  • 392
  • 1
  • 7
  • Which will replace apostrophes in the body of the HTML as well as attribute delimiters. Not a good solution here. – Oded Nov 24 '11 at 09:56
  • Well... in that case : (iString, @"(?<=\<[^\<\>]*)\'(?=[^\<>]*\>)", "\"") – TomL Nov 24 '11 at 10:20
  • Oded, I agree. But I was under the impression I was just answering the question, which is a simple change of ' to ". I wasn't implying regex is any good for parsing HTML. Oh well... – TomL Nov 24 '11 at 10:29