-2

HTML:

<dt>
    <a href="#profile-experience" >Past</a>
</dt>
<dd>
    <ul class="past">
        <li>
            President, CEO &amp; Founder <span class="at">at</span> China Connection
        </li>
        <li>
            Professional Speaker and Trainer <span class="at">at</span> Edgemont Enterprises
        </li>
        <li>
            Nurse &amp; Clinic Manager <span class="at">at</span> <span>USAF</span>
        </li>
    </ul>
</dd>​​​​​

I want match the <li> node. I write the Regex:

<dt>.+?Past+?</dt>\s+?<dd>\s+?<ul class=""past"">\s+?(?:<li>\s*?([\W\w]+?)+?\s*?</li>)+\s+?</ul>

In fact they do not work.

Scott Evernden
  • 39,136
  • 15
  • 78
  • 84
Dreampuf
  • 1,161
  • 1
  • 13
  • 28
  • 1
    please specify briefly what you want to do exactly? – Maulik Vora Aug 27 '10 at 05:13
  • 2
    Obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 If you really have to do it, you should please reformat your code and reduce problem a bit. – phimuemue Aug 27 '10 at 05:14
  • ....I's just string....in .Net/C#.... – Dreampuf Aug 27 '10 at 06:28
  • 1
    Even if the input is a string or a stream the regex for html is generally a bad idea. http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – Spoike Aug 27 '10 at 07:01

6 Answers6

2

No not parse HTML using a regex like it's just a big pile of text. Using a DOM parser is a proper way.

teukkam
  • 4,267
  • 1
  • 26
  • 35
2

Don't use regular expressions to parse HTML...

Community
  • 1
  • 1
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
1

Don't use a regular expression to match an html document. It is better to parse it as a DOM tree using a simple state machine instead.

I'm assuming you're trying to get html list items. Since you're not specifying what language you use here's a little pseudo code to get you going:

Pseudo code:

while (iterating through the text)

    if (<li> matched)

        find position to </li>
        put the substring between <li> to </li> to a variable

There are of course numerous third-party libraries that do this sort of thing. Depending on your development environment, you might have a function that does this already (e.g. javascript).

Spoike
  • 119,724
  • 44
  • 140
  • 158
1

Which language do you use?

If you use Python, you should try lxml: http://lxml.de. With lxml, you can search for the node with tag ul and class "past". You then retrieve its children, which are li, and get text of those nodes.

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
mrcuongnv
  • 95
  • 2
  • 7
  • Ok. You should do 2 steps. First, you extract the text inside tags **ul**. Then, you extract **li**. If you use Python, the code is here: http://pastebin.com/HesVF7zJ – mrcuongnv Aug 27 '10 at 17:51
0

If you are trying to extract from or manipulate this HTML, xPath, xsl, or CSS selectors in jQuery might be easier and more maintainable than a regex. What exactly is your goal and in what framework are you operating?

Peter DeWeese
  • 18,141
  • 8
  • 79
  • 101
-1

please learn to use jQuery for this sort of thing

Scott Evernden
  • 39,136
  • 15
  • 78
  • 84
  • 1
    I don't see any suggestion in that question that JavaScript is being used, and even if there was, "use jQuery" is a rubbish answer which would need to be more specific. – Quentin Aug 27 '10 at 05:19
  • 1
    hmmmm .. rubbish eh ? .. fascinating – Scott Evernden Aug 27 '10 at 05:20
  • "My engine is giving off steam!" "Use a spanner". – Quentin Aug 27 '10 at 05:33
  • 1
    Please -- you are kidding me. he asked exactly 'I want match the
  • node.' .. that's precisely what jQuery is designed to do . . match nodes. Look at all the other answers indicating he should process the DOM rather than use a regex. What's jQuery designed for eh???
  • – Scott Evernden Aug 27 '10 at 06:01