extract string by regular expressions in c#

Question

I have some html files with codes like this :

 <div style="border: 0px red solid; width: 633px; position: relative; margin: 0px;
                                                                float: right">
                                                                <font style="font-size: 8pt; color: Navy; font-weight: Bold;">Unit Name: </font>My Unit Name&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font style="font-size: 8pt; color: Navy; font-weight: Bold;">
                                                                    Manager: </font>My Manager Name&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font style="font-size: 8pt;
                                                                        color: Navy; font-weight: Bold;">Category: </font>My Category
                                                            </div>
                                                            <div style="border: 0px red solid; width: 122px; position: relative; margin: 0px;
                                                                padding: 0px;">
                                                                <button name="sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de" style="font-family: Tahoma;"
                                                                    onclick="OpenMyWin2(1,843442,8445,'bf61fd588f00cbe7a37dab20c62e1c63')">
                                                                    More Info</button></div>

I want to extract info in front of Category: & Manager: & Unit Name:. How can I use RegularExpression to extract those from large html file. those files may have 100 similar items.

The best way to handle this kind of tasks is through a dedicated library like [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) — Steve, Sep 26 '12 at 13:40
Parsing HTML with regex is a no-no. For a laugh, read [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — dario_ramos, Sep 26 '12 at 13:40

score 0 · Answer 1 · answered Sep 26 '12 at 13:39

0

I would recommend you consider using that tool: http://htmlagilitypack.codeplex.com/

It allows easily parse any HTML you want.

answered Sep 26 '12 at 13:39

berliner

1,887
3
15
23

Ωmega · Answer 2 · 2012-09-26T14:05:45.890

0

It is a bad idea to use regex to parse HTML code, however if you want to use regex anyway, use pattern:

>\s*Unit Name:[^>]*>([^<]+).*?>\s*Manager:[^>]*>([^<]+).*?>\s*Category:[^>]*>([^<]+)

which can be reduced to

>\s*(?:Unit Name|Manager|Category):[^>]*>([^<]+)

To trim   tails replace ([^<]+) in the regex pattern with (\w+).

edited Sep 26 '12 at 14:05

answered Sep 26 '12 at 13:44

Ωmega

42,614
34
134
203

I have a HTML string that repeat my pattern for 50 times in every page. I use IndexOf with an index. I solved it. – Ehsan Sadeghi Sep 29 '12 at 06:25

score 0 · Answer 3 · answered Sep 26 '12 at 13:49

0

Maybe this can help you. This uses Lookahead and Lookbehind Zero-Width Assertions.

 (?<=(Category:|Manager:|Unit Name:) (</font>)?).*?(?=(&|<))

RegexBuddy ScreenShot

enter image description here

answered Sep 26 '12 at 13:49

John Woo

258,903
69
498
492

extract string by regular expressions in c#

3 Answers3