C# parse XML from HTML Body and save to file

Question

C# after performing GET from the API it returns the XML code embedded in the HTML file similar to this:

<!DOCTYPE html>

<html lang="en">
    <head>
        <meta name="viewport" content="initial-scale=1, width=device-width">
        <title>config</title>
    </head>
    <body>
        
<CONFIG="2"/>
<VALUE1="1"/>
<VALUE2="2"/>
<CONFIGEND="0"/>

    </body>
</html>

I am trying to save the XML content from the body "<CONFIG ... CONFIGEND="0"/>" out to a file. My attempts using HtmlAgilityPack result in the XML data being modified as follows:

<CONFIG="2"></CONFIG>
...
<CONFIGEND="0"></CONFIGEND>

I am new to C# (and programming in general) so please be kind. Search attempts have left me more confused than I started :/

Hi! Welcome to SO! What is the API you using and why it is needed to put XML into HTML? — Evgeny Gorb, Feb 20 '21 at 23:01

score 2 · Accepted Answer · answered Feb 20 '21 at 21:39

Yes you have figured out HtmlAgilityPack is converting something. Html actually an Xml file. But System.Xml.XmlDocument cannot handle this html file. So you need to parse manually.

As Anis R. says, best way is RegularExpressions. To use RegularExpressions, you need to add using System.Text.RegularExpressions; to first lines.

Let's say your Html content is in htmlstring variable.

Firstly you need to define pattern for your case.

string regexPattern = @"\<body\>(.*?)\<\/body\>";
Regex regex = new Regex(regexPattern, RegexOptions.Singleline);

You need to use RegexOptions.Singleline option. Because your html content will have new line characters.

string body = regex.Match(htmlstring).Value;

With this, you will have :

<body>
        
<CONFIG="2"/>
<VALUE1="1"/>
<VALUE2="2"/>
<CONFIGEND="0"/>

    </body>

To remove body tags ;

string result = body.Replace("<body>", "").Replace("</body>", "");

To trim leading and trailing spaces;

string prettierResult = result.Trim();

Now you have ;

<CONFIG="2"/>
<VALUE1="1"/>
<VALUE2="2"/>
<CONFIGEND="0"/>

To save content to a file ;

File.WriteAllText("c:\\path-to-save", prettierResult);

Anis R. · Answer 2 · 2021-02-20T22:10:44.490

0

If the format is consistent¹ (e.g., you always want everything between <body>...</body>), then one way is to use a regex:

string pattern = @"<body>(.*)</body>";  
Regex rg = new Regex(pattern);  
        
string html = "<body>Content here</body>";  
        
// get first match and print it
Match firstMatch = rg.Matches(html)[0];
Console.WriteLine(firstMatch.Groups[1]); // "Content here"

(PS: this will need using System.Text.RegularExpressions;)

¹ Keeping this in mind

edited Feb 20 '21 at 22:10

answered Feb 20 '21 at 20:49

Anis R.

6,656
2
15
37

Thank you for this @Anis R., it got me on the correct track. I had to modify yours slightly, referencing https://www.brightfunction.co.uk/remove-html-with-regular-expressions/ as well. – figment512 Feb 20 '21 at 22:05
"If the format is consistent" - *if* is being the operating word here. Regular expression [should not](https://stackoverflow.com/a/6751339/4317297) be used to parse HTML. While it may work in this particular case, in general, it is a bad idea. – Riwen Feb 20 '21 at 22:05
@Riwen Absolutely! I guess I should have better emphasized it - Answer edited. However, when you only want to handle a very specific case (like here), it can be okay. – Anis R. Feb 20 '21 at 22:08
@Riwen I tried several other methods and always ended up with modified XML data. I understand RegEx should not be used in general - this is a very specific case where the formatting won't ever deviate. If you have a solution I welcome it! – figment512 Feb 20 '21 at 22:10

C# parse XML from HTML Body and save to file

2 Answers2