1

I have a piece of html code like the following one:

<td width="24%"><b>Something</b></td>
          <td width="1%"></td>
          <td width="46%" align="center">
           <p><b>
    needed
  value</b></p>
          </td>
          <td width="28%" align="center">
            &nbsp;</td>
        </tr>

What is a good regex pattern to extract the first text node (not tags but the text inside) after the word Something I mean I want to extract

     needed
  value

and nothing more.

I cant figure out a working regex pattern in php.

EDIT: I am not parsing whole html document but few lines of it so all I want is to do it using Regex and no HTML parsers.

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
Boris D. Teoharov
  • 2,319
  • 4
  • 30
  • 49
  • 3
    Don't parse HTML using a regular expression. [See this post](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) on why. – Jonah Bishop Oct 04 '12 at 17:25
  • 2
    Please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an [HTML parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) instead. – Madara's Ghost Oct 04 '12 at 17:26
  • 1
    thank you for the answers. I know regex is not the proper way to do it but the only thing I am parsing is "needed value" so I think it is an overkill for this task to use Html parser. – Boris D. Teoharov Oct 04 '12 at 17:29

1 Answers1

4

Ignoring potential issues parsing HTML with regex, the following pattern should match your example code:

Something(?:(?:<[^>]+>)|\s)*([\w\s*]+)

This will match Something, followed by any list of HTML tags (or whitespace) and match the very next block of text, \w (including whitespace).

You can use this in PHP's preg_match() method like this:

if (preg_match('/Something(?:(?:<[^>]+>)|\s)*([\w\s*]+)/', $inputString, $match)) {
    $matchedValue = $match[1];
    // do whatever you need
}

Regex Explained:

Something         # has to start with 'Something'
(?:               # non-matching group
    (?:           # non-matching group
        <[^>]+>   # any HTML tags, <...>
    )
    | \s          # OR whitespace
)*                # this group can match 0+ times
(
    [\w\s*]+      # any non-HTML words (with/without whitespace)
)
newfurniturey
  • 37,556
  • 9
  • 94
  • 102