Extracting HTML alt text with regex?

Question

I'm writing an importer for PHPbb to Discourse, using Ruby.

All over the PHPbb database are strings like

<!-- s:( --><img src="{SMILIES_PATH}/rice_frown.png" alt=":(" title="Frown" /><!-- s:( -->
<!-- s:'( --><img src="{SMILIES_PATH}/rice_crying.png" alt=":'(" title="Crying" /><!-- s:'( -->

I need to replace the string with the symbols in the alt attribute, so for the above I need :( and :'(. I'm substituting other things with regexes but I can't get the right pattern for this.

score 2 · Accepted Answer · edited May 23 '17 at 10:25

As people are always quick to point out, you can't completely parse HTML with regex. However, that doesn't mean you can't do useful things with HTML and regex. In your case, it's not a particularly hard problem. Try this:

<img .*?alt="(.*?)".*?>

And just replace those matches with the first group:

input.gsub /<img .*?alt="(.*?)".*?>/i, '\1'

If you really want to be SUPER ROBUST, you can doll that regex up a little:

s.gsub /<\s*img .*?alt\s*=\s*(["'])(.*?)\1.*?>/i, '(\2)'

That handles the following variations (note whitespace, type of quotation mark, and capitalization):

< img alt="foo" />
<IMG alt="foo" />
<img alt = "foo" />
<img alt='foo' />

And so on....

rewritten · Answer 2 · 2013-11-15T22:16:42.380

2

There are boatloads of libraries which permit you to load HTML. The best known is Nokogiri, with which you could do

string = '<!-- s:( --><img src="{SMILIES_PATH}/rice_frown.png" alt=":(" title="Frown" /><!-- s:( -->'
alt_str = Nokogiri::HTML(string).css("img").first["alt"]

edited Nov 15 '13 at 22:16

answered Nov 15 '13 at 21:40

rewritten

16,280
2
47
50

True, but why use a library when one regex will do? :) I'll keep nokogiri in mind though, thanks. – rikkit Nov 15 '13 at 22:44
Why indeed. What happens *when* the HTML changes and the regex breaks. Unless you own the HTML and can ensure it won't change you have to plan on maintaining the pattern. A parser, such as Nokogiri mitigates the problem by breaking down the content into something that is much more resilient. Yeah, regex are cool, but they're not made for HTML. You can get them to work, but the result is fragile. There are times we have to preprocess HTML before passing it off to a parser, to fix pathologically damaged markup, otherwise try the parser first; It's really worth going that route by default. – the Tin Man Nov 16 '13 at 00:07
I get that, but in this specific case I don't need the flexibility. PHPbb3 only stores smilies in this weird format. – rikkit Nov 16 '13 at 00:48
You don't need the flexibility. But using the library you get the result you want without thinking about regexps. – rewritten Nov 16 '13 at 15:40

Extracting HTML alt text with regex?

2 Answers2