2

I'm trying to select some text using regular expressions leaving all img tags intact.

I've found the following code that selects all img tags:

/<img[^>]+>/g

but actually having a text like:

This is an untagged text.
<p>this is my paragraph text</p>
<img src="http://example.com/image.png" alt=""/>
<a href="http://example.com/">this is a link</a>

using the code above will select the img tag only

/<img[^>]+>/g #--> using this code will result in:
<img src="http://example.com/image.png" alt=""/>

but I would like to use some regex that select everything but the image like:

/magical regex/g # --> results in:
This is an untagged text.
<p>this is my paragraph text</p>
<a href="http://example.com/">this is a link</a>

I've also found this code:

/<(?!img)[^>]+>/g

which selects all tags except the img one. but in some cases I will have untagged text or text between tags so this won't work for my case. :(

is there any way to do it? Sorry but I'm really new to regular expressions so I'm really struggling for few days trying to make it work but I can't.

Thanks in advance


UPDATE:

Ok so for the ones thinking I would like to parse it, sorry I don't want it, I just want to select text.

Another thing, I'm not using any language in specific, I'm using Yahoo Pipes which only provide regex and some string tools to accomplish the job. but it doesn't evolves any programming code.

for better understanding here is the way regex module works in yahoo pipes:

http://pipes.yahoo.com/pipes/docs?doc=operators#Regex


UPDATE 2

Fortuntately I'm being able to strip the text near the img tag but on a step-by-step basis as @Blixt recommended, like:

<(?!img)[^>]+> , replace with "" #-> strips out every tag that is not img
(?s)^[^<]*(.*), replace with $1  #-> removes all the text before the img tag
(?s)^([^>]+>).*, replace with $1 #-> removed all the text after the img tag

the problem with this is that it will only catch the first img tag and then I would have to do it manually and catch the others hard-coding it, so I still not sure if this is the best solution.

Community
  • 1
  • 1
zanona
  • 12,345
  • 25
  • 86
  • 141
  • 2
    Arggghhhhh! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Quentin Dec 05 '10 at 12:08
  • What language are you using, Javascript, PHP? – Orbling Dec 05 '10 at 12:09
  • 2
    @David: God I hate this constant anti-regex on this site for HTML. You can not *parse* HTML with regex, but tasks like this can be accomplished simply. He is not *parsing* it. – Orbling Dec 05 '10 at 12:10
  • 1
    I don't see how this use-case is any better.HTML is not regular, so why insist on using the wrong tool for the job? It eludes me. – Jim Brissom Dec 05 '10 at 12:16
  • thanks @Orbling, that's right, I really don't want to parse it I just want to select all text except `]+>` this is simply text selection nothing else. – zanona Dec 05 '10 at 12:19
  • @Jim Because the "right" tool is a) not always available (HTML is not XML, and HTML parsing is non-trivial), b) is vastly slower, c) is overkill if you can achieve what you want with a pattern match. – Orbling Dec 05 '10 at 12:27
  • @Orbling now come the newlines. :) –  Dec 05 '10 at 14:01
  • @Time Machine `\s` covers newlines in multi-line mode, as does the `[^>]+` – Orbling Dec 05 '10 at 14:05
  • @ludico I think your problem is really "How do I remove elements in Yahoo pipes?" Yahoo pipes has a whole range of tools of which regex is only one. I am not a Yahoo-pipe expert but a quick glance suggests that there are tools which will do what you want quickly and efficiently and are easy to learn and use. Part of the value of SO is that people will try to give you the answer you actually need rather than what you ask. Defining your requirements as fully as possible always helps – peter.murray.rust Dec 05 '10 at 14:05
  • @Orbling and comments. –  Dec 05 '10 at 14:05
  • @Time Machine You can get rid of that issue with lookahead and lookbehind, though it'd be easier just to strip the image tags/process them within the comments, usually that would not cause an issue in most use-cases. – Orbling Dec 05 '10 at 14:08
  • @Orbling I actually mean <img ... –  Dec 05 '10 at 14:10
  • @peter Yes he should have defined his question accurately to begin with, my answer was before Yahoo Pipes was mentioned. Having said that, Yahoo Pipes Regex is capable I believe. I think that whilst SO can offer alternatives to the approach requested by a question, it should not enforce it if possible in the way they have asked. If there is an another way, it should be presented with advantages. I know you did that in your answer. – Orbling Dec 05 '10 at 14:10
  • @Time Machine: Still can be done with a negative lookahead after the first `<` - perl-regexp are *very* powerful. Most people do not have a clue about the advanced features sadly. Incidentally, I think comments inside the actual tag are invalid are they not? – Orbling Dec 05 '10 at 14:11

2 Answers2

1

The regexp you have to find the image tags can be used with a replace to get what you want.

Assuming you are using PHP:

$htmlWithoutIMG = preg_replace('/<img[^>]+>/g', '', $html);

If you are using Javascript:

var htmlWithoutIMG = html.replace(/<img[^>]+>/g, '');

This takes your text, finds the <img> tags and replaces them with nothing, ie. it deletes them from the text, leaving what you want. Can not recall if the <,> need escaping.

Orbling
  • 20,413
  • 3
  • 53
  • 64
  • thanks for this @Orbling, sorry if I've expressed myself the wrong way I think I need to select all the text, expect the one in `img` tag because I want to do what you've mentioned, replace all the non 'img' text with empty string which will left me only with the images. My target in this case would be the images and not the text itself :) thanks – zanona Dec 05 '10 at 12:26
  • So you want all the images, but not the text, the inverse of this? This will return the text without the images, which is what it sound like you are saying still. – Orbling Dec 05 '10 at 12:34
  • @ludicco You need to switch on global matching the `g` option, both of my examples have it on, see: http://stackoverflow.com/questions/360492/regular-expression-on-yahoo-pipes – Orbling Dec 05 '10 at 14:06
0

Regular expression matches have a single start and length. This means the result you want is impossible in a single match (since you want the result to end at one point, then continue later).

The closest you can get is to use a regular expression that matches everything from start of string up to start of <img> tag, everything between <img> tags and everything from end of <img> tag to end of string. Then you could get all matches from that regular expression (in your example, there would be two matches).

The above answer is assuming you can't modify the result. If you can modify the result, simply replace the <img> tags with the empty string to get your result.

Blixt
  • 49,547
  • 13
  • 120
  • 153