2

I have an HTML saved in a.txt file which looks like this.

<HTML> <HEAD>      <TITLE></TITLE> </HEAD> 
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">  <P STYLE="margin: 0"></P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">UNITED STATES</P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">SECURITIES AND EXCHANGE COMMISSION</P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">WASHINGTON, D.C. 20549</P>  
<P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">&nbsp;</P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center"></P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center"><B>&nbsp;</B></P>   
<TABLE CELLSPACING="0" CELLPADDING="0" STYLE="font: 10pt Times New Roman, Times, Serif; width: 100%; border-collapse: collapse"> <TR STYLE="vertical-align: top">     <TD STYLE="width: 5%; padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">[X]</FONT></TD>     <TD STYLE="width: 95%; padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">ANNUAL REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</FONT></TD></TR> <TR STYLE="vertical-align: top">     
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt">&nbsp;</TD></TR> <TR STYLE="vertical-align: top">     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD> 
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt; text-align: right"><FONT STYLE="font-size: 10pt">For the fiscal year ended <B><U>October 31, 2012</U></B></FONT></TD></TR> <TR STYLE="vertical-align: top">     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt">&nbsp;</TD></TR> <TR STYLE="vertical-align: top">     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">[ ]</FONT></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">TRANSITION REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</FONT></TD></TR> <TR STYLE="vertical-align: top">    
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt">&nbsp;</TD></TR> <TR STYLE="vertical-align: top">    
 <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt; text-align: right"><FONT STYLE="font-size: 10pt">For the transition period from _________ to ________</FONT></TD></TR>

I need text which preserves Newline. All these text are getting combined into a single line. How to handle this? Below is my C# code

string text = File.ReadAllText(@"C:\a.txt",Encoding.UTF8);
Regex regex = new Regex("<[^>]+>");
 text = regex.Replace(text, " ").Replace("(&#160;)+", Environment.NewLine).Replace("&#32;", "").Replace("&#8217;", "'").Replace("\r\n\r\n(\r\n)+", Environment.NewLine);
 text = HttpUtility.HtmlDecode(text);
  Console.WriteLine(text);  
jgillich
  • 71,459
  • 6
  • 57
  • 85
newbieCSharp
  • 181
  • 2
  • 22
  • We still don't know what you are trynig to achieve, can you be more specific? Also you should read this http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems – roundcrisis May 14 '14 at 15:01
  • @Miau, right now, it prints UNITED STATES SECURITIES AND EXCHANGE COMMISSION in a single line. But when u look at the page, UNITED STATES is in 1st line, SECURITIES AND EXCHANGE COMMISSION is in next line. I want this to be preserved. – newbieCSharp May 14 '14 at 15:08
  • Placing 'Paragraphs' below each other is up to the HTML rendered. It can easily be changed via CSS. So in your case you want each Paragraph on a new line, because 'preserving' is not relevant because it's not there! – RvdK May 14 '14 at 15:11
  • @RvdK, how can this be done in C#? – newbieCSharp May 14 '14 at 15:13
  • @newbieCSharp - you could replace `` with a newline instead of a space. – Hans Kesting May 14 '14 at 17:40

2 Answers2

1

I would never use regex to parse HTML, instead, use the HtmlAgilityPack, you can do a lot of things just using simple XQuery/XPath, example:

        HtmlDocument doc = new HtmlDocument();
        doc.Load(@"C:\temp\stackoverflow\question23657841\question23657841\a.html");

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p"))
        {
            Console.WriteLine(node.InnerHtml);
        }

The output is:

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
WASHINGTON, D.C. 20549
&nbsp;

<b>&nbsp;</b>

And simply switching the XQuery to //font you get this:

[X]
ANNUAL REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended <b><u>October 31, 2012</u></b>
[ ]
TRANSITION REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to ________
Community
  • 1
  • 1
Gustavo F
  • 2,071
  • 13
  • 23
0

Why not read File line by line File.ReadAllLines() does just that

Matas Vaitkevicius
  • 58,075
  • 31
  • 238
  • 265
  • He wants to strip all HTML tags, but keep some of the layout provided by that HTML, which is not in the line endings of the text file. – Hans Kesting May 14 '14 at 17:39
  • @HansKesting In what language "I need text which preserves Newline. All these text are getting combined into a single line. How to handle this?" does it mean I want to strip all HTML tags? – Matas Vaitkevicius May 14 '14 at 22:21
  • not in the text itself, but in the code and comments: the OP strips tags `Regex("<[^>]+>")` and apparently want the `

    ` formatting ("newlines" according to him) preserved.

    – Hans Kesting May 15 '14 at 07:18