3

I have a html file from a website and I work with a regex to search for words and write these words to a document. I have this text:

<div class="scrollable " style="height: 200px;">
        <div>
            <p>CO-Schrank: nicht ben&ouml;tigtes ausbauen</p>
<p><strong>________________________________________________________________________</strong></p>

<p><strong>==&gt;&nbsp; wird nicht mehr ben&ouml;tigt!<br /></strong>z-B.: IUC</p>

<p>CO-Management in Gen. 2 implementieren</p>

<ol>
<li>Ausbau der PCI-Karten aus ZKA-PC in CO-PC- PC-Sys 02 TP 55, 56, 61 sind noch Profibus im ZKA-PC ==&gt; in CO-PC- PC-Sys 02 greift dann auf CO-PC f&uuml;r Datenaufzeichnung =&gt; Betrieb wieder aufnehmen</li>

<li>Ausbau der IUC</li>

<li>Testaufbau am CO-PC f&uuml;r den CO-Algorithmus und Datenspeicherung</li>

<li>Gen. 2 in CO-Management implementieren- pro Pr&uuml;fling 3 Min. (3 Min. x 48 HG x 10 Messungen)&nbsp;= 1440 Min. = 24 h- Messzeit 1-2 Min.</li>

</ol>


</div></div>

Now I want all the text in the <div>.... </div> too. I wrote this code but it is not working:

Match description = Regex.Match(line, "^<div class=\"scrollable \"^(.*?)$div>", 
    RegexOptions.Multiline);//multiple line

if (description.Success)
{
    //Console.WriteLine(status_id.Groups[1].Value);
    System.IO.StreamWriter file = new System.IO.StreamWriter(@"C:\\Webasto\\csv-"+zahl+".txt");
    file.WriteLine(id.Groups[1].Value + ";4;4;" + subject.Groups[1].Value + ";" + due_date.Groups[1].Value+";NULL;"+status_id.Groups[1].Value+";"//+assigned.Groups[1].Value
        +";"
        +priority.Groups[1].Value+";NULL;"+autor.Groups[1].Value+";0;"+created_on.Groups[1].Value+";"+start_date.Groups[1].Value+";"+done_ratio.Groups[1].Value+";"+hours.Groups[1].Value
        +";NULL;"+id.Groups[1].Value+";1;2;0;"+closed.Groups[1].Value+";");
    file.Close();
}
Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Hans Sroeb
  • 41
  • 4

2 Answers2

2

You have a misunderstanding of what MultiLine means (I don't blame you, I have to think twice every time I use regex). MultiLine means that every line (ended with \n) is treated on its own.

You need SingleLine, which treats the whole string as if it was one line.

Side note: it is a bad idea to use Regex to parse HTML. Use a decent HTML parser instead.

Patrick Hofman
  • 153,850
  • 22
  • 249
  • 325
  • 1
    True. I always thought singleline was a horrible name for what actually means "dot matches all". Especially since you can have singleline and multiline mode active at the same time. – timgeb Dec 29 '15 at 14:01
  • Yes, horrible naming convention. – Patrick Hofman Dec 29 '15 at 14:02
1

It's well known that you should use xhtml parser instead of regex.

Anyway, you can use regex if you know what is the character set used in your html. In case you still want to use a regex, then you can use a regex with the single line flag like this:

(?s)<div>.*?<\/div>

Working demo

Or using a regex trick:

<div>[\s\S]*?<\/div>
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123