1

I am trying to use Regular Expressions to decode some HTML I retrieve from a webpage. I want to transform some <iframe> tags into links.

The code I'm using should be working fine according to me and some testprograms, however when I run it on my android device it does not find any matches (where as in the test programs it does).

The regular expression I am using is as follows (keep in mind I'm coding in Java, so I need to escape the escape character as well):

String regularExpression = "<iframe.+?src=\\\\?(\\S+).+?(><\\\\?/iframe>|\\\\?/>)";
String replacement = "<a href=$1>Youtube</a>";

input.replaceAll(regularExpression, replacement);

From what I can gather from this it should replace all <iframe> tags that have a src attribute to hyperlinks with that source. However when I feed the following input to it, it does nothing with it:

<iframe src=\"http:\/\/www.youtube.com\/embed\/s6b33PTbGxk\" frameborder=\"0\" width=\"500\" height=\"284\"><\/iframe>

The response is simply the exact same text, only with the escape-characters removed:

<iframe src="http://www.youtube.com/embed/s6b33PTbGxk" frameborder="0" width="500" height="284"></iframe>

Can someone help me and explain what I'm doing wrong? I only started learning Regular Expressions yesterday, but I just can't for the life of me figure out why this doesn't work.

Lars
  • 4,082
  • 2
  • 20
  • 20

2 Answers2

2

The method String.replaceAll doesn't modify the string. It can't because strings are immutable. Instead it returns a new string with the result. You need to assign this result to something:

String result = input.replaceAll(regularExpression, replacement);

Also, don't use regular expressions to parse HTML.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Can't believe I made such a stupid mistake, thanks for pointing it out to me. As for why I'm using regex, I get fed some HTML code by people who can't write HTML code, and it has to work on both iPhone and Android, it currently works on iPhone, but since I'm developing a new Android app, this will be my temporary solution. Thanks again for helping. – Lars Oct 25 '11 at 10:50
0
String resultString = subjectString.replaceAll("(?=<(iframe)\\s+src\\s*=\\s*(['\"])(.*?)\\2[^>]*>).*?</\\1>", "<a href=$3>Youtube</a>");

This should work. In addition to @Mark Byers note your regex does not seem to match to your input, even with removed (double) backslashes.

FailedDev
  • 26,680
  • 9
  • 53
  • 73