-4

Only replace keywords that are not inside of an anchor:

// replace
... keyword ...  -> ... <a href="url">keyword</a> ...

// not replace
...<a href=""> ... keyword ... </a>...  -> ...<a href=""> ... keyword ... </a>...

Please provide a suitable pattern to accomplish this.

Note: I am working over a string type variable, not over a HTML document!


EDIT: Ok, Ok. I'll use an HTML parser, thanks!

Igor Parra
  • 10,214
  • 10
  • 69
  • 101
  • 6
    A regex is not suitable for this. Use a HTML parser so you can easily access text nodes that are not inside an `` tag. – ThiefMaster May 31 '12 at 11:58
  • 5
    I refer you to this answer: http://stackoverflow.com/a/1732454/944982 – LexyStardust May 31 '12 at 12:00
  • I am working over a variable not over a HTML document. – Igor Parra May 31 '12 at 12:01
  • You're working over something that's formatted like an HTML/XML document. The theorems that show that you can't parse HTML with regexes also apply to this case. It doesn't matter whether the string you're manipulating comes from an actual website, a variable, aliens from outer space, or if they mysteriously appeared on a piece of toast -- you _can't do this with regexes._ – Louis Wasserman May 31 '12 at 12:02
  • OK got it, thanks. **No need to downvote. Why so anxious?** – Igor Parra May 31 '12 at 12:04
  • Nothwithstanding SO going bonkers over HTML and regex again, that's easy to accomplish with some lookaround pattern matching, since keywords ought to be thightly enclosed by their specific link tags in your case. We have a few duplicates on this... – mario May 31 '12 at 12:04
  • @mario - but the point stands that it would be neater to do this with something designed to parse html right? – LexyStardust May 31 '12 at 12:05
  • 1
    @NomikOS: I didn't downvote, but I suspect the downvotes don't relate to your wanting to use regular expressions; I expect they relate to your not having presented your own attempt, shown your own work. Your question reads a bit like "please do this for me," which is something that tends to get downvoted here. – T.J. Crowder May 31 '12 at 12:07
  • @t-j-crowder No, no. We never can know who downvote, I know it, is just that there are a lot of users that seems enjoy downvote at the first smell of blood, I hate that. About show a code that is precisely the reason to ask here. – Igor Parra May 31 '12 at 12:11

1 Answers1

3

Regular expressions can't reliably be used to do this sort of thing, because HTML is not a regular language. If you use a parser like JSoup to process your string variable into a DOM, then serialize the result back into a string, you can get a reliable result.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • That might be a useful comment (despite the 'regular language' parrotting being a bit off, of course). – mario May 31 '12 at 12:01
  • 1
    @mario: A regular language is something that can be parsed with a regular expression; it's completely on point. I think this is an answer, because it helps the OP achieve the *result* he's looking for. It doesn't do it via the *means* he thought, but lots of answers usefully go a different direction from what the OP was expecting. – T.J. Crowder May 31 '12 at 12:02
  • Meh. I've always considered "it's impossible to do what you're trying to do; here's a workaround though" answers to be totally appropriate. – Louis Wasserman May 31 '12 at 12:03
  • Regular expressions alone won't parse HTML but it doesn't mean you can't use them at all in html parsing. – Esailija May 31 '12 at 12:04
  • @Esailija: True, and it wouldn't surprise me to find that JSoup and similar making use of regular expressions as *part* of the parsing. If you can come up with a regex that will *reliably* do what the OP wants, without blowing up when it encounters a matching class name, or a literal `>` in the HTML text, by all means post it. :-) – T.J. Crowder May 31 '12 at 12:06
  • Modern regular expressions go beyong regular languages. It doesn't get trueer from refurbishing that misunderstandment over and over again. Besides, we have enough duplicates on this topic to not warrant a shallow two-line repwhoring answer. (I still don't get how people always proclaim the overkill answer for a pattern matching problem to be on topic anyway.) – mario May 31 '12 at 12:06
  • Yeah I know but sometimes I see people saying this stuff even when a regular expression is not even the main parsing engine – Esailija May 31 '12 at 12:07
  • @Esailija False, regular expressions alone can parse HTML, it doesn't mean it's a good idea though. Regular expressions are never a good idea when it comes to performance and critical applications. In the background they also use the Boyer moore algorithms and whatnot, regular expressions just provide a way to save you from most it all – Jeffrey Vandenborne May 31 '12 at 12:07
  • @mario: No need to be offensive, and I wouldn't expect this answer to be greatly upvoted (this ground's so been covered, though I can't find an exact duplicate). And by all means, if you can reliably achieve what the OP is requesting using Java's regular expressions, let's see it. That would be great! – T.J. Crowder May 31 '12 at 12:08
  • 1
    @JeffreyVandenborne do you have any references for that? – Esailija May 31 '12 at 12:10
  • It is very common say `regular expressions` are evil here in SO. Please. One thing is to ask if it is a suitable tool for something and another is just to ask for a pattern! – Igor Parra May 31 '12 at 12:17
  • @t-j-crowder No!, I am giving my opinion about the topic... Please can you suggest a solution via HTML parser or a regex pattern?. – Igor Parra May 31 '12 at 12:23
  • @NomikOS: As I said, I don't think it can be done reliably with regular expressions. I haven't used JSoup (although that'll probably change soonish), but [their cookbook](http://jsoup.org/cookbook/) seems like it would have what you need... – T.J. Crowder May 31 '12 at 12:27
  • OK, this topic have been enough treated here, I'll investigate the `parser way`. Thanks, really.- – Igor Parra May 31 '12 at 12:31