How to write this Regex

Question

HTML:

<dt>
    <a href="#profile-experience" >Past</a>
</dt>
<dd>
    <ul class="past">
        <li>
            President, CEO &amp; Founder <span class="at">at</span> China Connection
        </li>
        <li>
            Professional Speaker and Trainer <span class="at">at</span> Edgemont Enterprises
        </li>
        <li>
            Nurse &amp; Clinic Manager <span class="at">at</span> <span>USAF</span>
        </li>
    </ul>
</dd>

I want match the <li> node. I write the Regex:

<dt>.+?Past+?</dt>\s+?<dd>\s+?<ul class=""past"">\s+?(?:<li>\s*?([\W\w]+?)+?\s*?</li>)+\s+?</ul>

In fact they do not work.

Obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 If you really have to do it, you should please reformat your code and reduce problem a bit. — phimuemue, Aug 27 '10 at 05:14
Even if the input is a string or a stream the regex for html is generally a bad idea. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — Spoike, Aug 27 '10 at 07:01

score 2 · Answer 1 · answered Aug 27 '10 at 05:12

2

No not parse HTML using a regex like it's just a big pile of text. Using a DOM parser is a proper way.

answered Aug 27 '10 at 05:12

teukkam

4,267
1
26
35

score 2 · Answer 2 · edited May 23 '17 at 09:57

2

Don't use regular expressions to parse HTML...

edited May 23 '17 at 09:57

Community

1
1

answered Aug 27 '10 at 05:14

Alex Martelli

854,459
170
1,222
1,395

score 1 · Accepted Answer · answered Aug 27 '10 at 05:20

Don't use a regular expression to match an html document. It is better to parse it as a DOM tree using a simple state machine instead.

I'm assuming you're trying to get html list items. Since you're not specifying what language you use here's a little pseudo code to get you going:

Pseudo code:

while (iterating through the text)

    if (<li> matched)

        find position to </li>
        put the substring between <li> to </li> to a variable

There are of course numerous third-party libraries that do this sort of thing. Depending on your development environment, you might have a function that does this already (e.g. javascript).

score 1 · Answer 4 · edited Nov 13 '14 at 07:46

1

Which language do you use?

If you use Python, you should try lxml: http://lxml.de. With lxml, you can search for the node with tag ul and class "past". You then retrieve its children, which are li, and get text of those nodes.

edited Nov 13 '14 at 07:46

twasbrillig

17,084
9
43
67

answered Aug 27 '10 at 05:22

mrcuongnv

95
2
7

Ok. You should do 2 steps. First, you extract the text inside tags **ul**. Then, you extract **li**. If you use Python, the code is here: http://pastebin.com/HesVF7zJ – mrcuongnv Aug 27 '10 at 17:51

score 0 · Answer 5 · answered Aug 27 '10 at 05:24

0

If you are trying to extract from or manipulate this HTML, xPath, xsl, or CSS selectors in jQuery might be easier and more maintainable than a regex. What exactly is your goal and in what framework are you operating?

answered Aug 27 '10 at 05:24

Peter DeWeese

18,141
8
79
101

score -1 · Answer 6 · answered Aug 27 '10 at 05:16

-1

please learn to use jQuery for this sort of thing

answered Aug 27 '10 at 05:16

Scott Evernden

39,136
15
78
84

1

I don't see any suggestion in that question that JavaScript is being used, and even if there was, "use jQuery" is a rubbish answer which would need to be more specific. – Quentin Aug 27 '10 at 05:19
1

hmmmm .. rubbish eh ? .. fascinating – Scott Evernden Aug 27 '10 at 05:20
"My engine is giving off steam!" "Use a spanner". – Quentin Aug 27 '10 at 05:33
1

Please -- you are kidding me. he asked exactly 'I want match the
node.' .. that's precisely what jQuery is designed to do . . match nodes. Look at all the other answers indicating he should process the DOM rather than use a regex. What's jQuery designed for eh???

Scott Evernden

Aug 27 '10 at 06:01

How to write this Regex

6 Answers6