1

I'm using preg_match function in PHP in order to extract some values from a RSS Feed. Inside this feed content there is something like this:

<li><strong>Something:</strong> A text with non alphanumeric characters (more text), more text with non alphanumeric characters (more text)</li>

I need to get those "A text with non alphanumeric characters" and "more text with non alphanumeric characters" to save them in a database. I don't know if using regular expressions is the best way to do it.

Thank you so much.

WedgeSparda
  • 1,161
  • 1
  • 15
  • 40
  • What's the reason for stripping out those chars? And what chars are they? – MatCarey Jun 11 '12 at 12:01
  • 2
    The best way to do this would be to use a PHP RSS parser and not use regex - some guidance: http://stackoverflow.com/questions/250679/best-way-to-parse-rss-atom-feeds-with-php – Matthew Riches Jun 11 '12 at 12:01

3 Answers3

1

If you want to use regex (i.e. quick and dirty, not really too maintainable), this will give you the text:

$input = '<li><strong>Something:</strong> A text with non alphanumeric characters (more text), more text with non alphanumeric characters (more text)</li>';

// Match between tags
preg_match("#</strong>(.*?)</li>#", $input, $matches);
// Remove the text inside brackets
echo trim(preg_replace("#\s*\(.*?\)\s*#", '', $matches[1]));

Though, nested brackets may fail.

Jay
  • 3,285
  • 1
  • 20
  • 19
  • I don't have enough reputation to comment on other answers, but beware that buckley's won't work (as they have said, but it might not clear), if it doesn't have exactly one comma. – Jay Jun 11 '12 at 12:05
0

Given that the structure is always the same you can use this regex

</strong>([^,]*),([^<]*)</li>

group 1 will have the first fragment, group 2 the other

Once you start parsing html/xml with regexes it becomes quickly apparent that a full blown parser is better suited. For small or throwaway solution you a regex can be useful.

buckley
  • 13,690
  • 3
  • 53
  • 61
0
$str = '<li><strong>Something:</strong> A text with non alphanumeric characters (more text), more text with non alphanumeric characters (more text)</li>';
$str = preg_replace('~^.*?</strong>~', '', $str); // Remove leading markup
$str = preg_replace('~</li>$~', '', $str); // Remove trailing markup
$str = preg_replace('~\([^)]++\)~', '', $str); // Remove text within parentheses
$str = trim($str); // Clean up whitespace
$arr = preg_split('~\s*,\s*~', $str); // Split on the comma
Geert
  • 1,804
  • 15
  • 15