Extracting data from HTML file using c# script

Question

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)

What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?

Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email @sampleemail.com but I think that is a bad approach since in some html files there will be a lot of "<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked

Sample tag containing information of from:

<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>

HTML FILE output:

Tấn Nguyên · Accepted Answer · 2020-05-05T08:46:17.060

2

HTMLAgilityPack is your friend. Simply using XPath like //p[@class ='MsoNormal'] to get tags content in HTML

public static void Main()
{
    var html =
    @"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>                                     ";

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    var nodes = htmlDoc.DocumentNode.SelectNodes("//p[@class ='MsoNormal']");

    foreach(var node in nodes)
        Console.WriteLine(node.InnerText);      
}

Result:

From:1234@sampleemail.com

Update

We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.

    public static void MainFunc()
    {
        string str = @"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234@sampleemail.com<o:p></o:p></span></p>                                     ";
        var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
        Console.WriteLine(result);
    }

edited May 05 '20 at 08:46

answered May 05 '20 at 05:20

Tấn Nguyên

1,607
4
15
25

I forgot to add in my question that my requirements is that I don't rely on 3rd party like htmlagilitypack. Is this possible? – keinz May 05 '20 at 05:29
1

@lonewolfkein HTMLAgilityPack is built from [System.Xml.XPath.XPathDocument](https://learn.microsoft.com/en-us/dotnet/standard/data/xml/reading-xml-data-using-xpathdocument-and-xmldocument). You have 3 choices, HtmlAgilityPack for simple, more code with System.Xml or complicated code with your own parser. – Tấn Nguyên May 05 '20 at 05:54
Give more detail the html, it could be hardcode somewhere. – Tấn Nguyên May 05 '20 at 05:56
as of now , they forbid to use 3rd party such as html agility anything that requires download. I have searched for system.xml but its very complicated for my coding skills level. Can you tell me why i need more detail of the html? It's too long and the one specified is what I really needed ...btw I relaly appreciate the help ,Mr. Tan Nguyen. – keinz May 05 '20 at 07:45
1

@lonewolfkein yah it seems we cannot use system.xml because it's validated the html before using XPath. The more specific html the more cases we could test it by `Regex` for writing simple parser. If you could break it that line, I suggest simple Regex pattern. I updated my answer – Tấn Nguyên May 05 '20 at 08:43
1

seems like I cant use it because its a complicated html document, thanks for the help tan. – keinz May 05 '20 at 09:25
@lonewolfkein It depends on your html structure. You might check the [Regex parser online](https://regexr.com) whether it's suitable with the html – Tấn Nguyên May 05 '20 at 09:32
1

again thank youfor all this information im new to. Im really grateful will check it out – keinz May 05 '20 at 09:47

Extracting data from HTML file using c# script

1 Answers1