0

Hi I’m trying to use a replaceAll in java, to delete some html content of image:

This is my input

String html = '&nbsp;asd<i>&nbsp;qwe qwe<u>qweqwe</u></i><u>wqeqwesd.<img alt="vechile" src="urldirectionstring" style="float:left; height:190px; width:400px" /></u>';

So what I’m trying to do is replace all content of <img ...> and just return in replace this:

"Image Url: urldirectionstring";

So just replace the tag img, all the rest, let it, only touch this tag, and for now I have this, but its not enougth;

String replaceImg = html.replaceAll("<img[^>]*/>","Image Url: "+$srcImgdirection);

So, as you can see, I don’t have an idea how to get the urldirectionstring as variable in the replace.

----------- LAST EDIT -----------

I found this regex to get the urlstringdirection, but now I don’t how to replace it only and add the text:

String replaceImg = html.replaceAll("<img.*src="(.*)"[^>]*/?>","Image Url: "+$srcImgdirection);
VGR
  • 40,506
  • 4
  • 48
  • 63
Alberto Acuña
  • 512
  • 3
  • 9
  • 28
  • 1
    are you aware that there are libraries for parsing HTML properly, and regex are not very suited to the task? – Patrick Parker Feb 24 '17 at 13:38
  • 1
    I agree with Patrick but for future application of `replaceAll()`: you can access the capturing groups in the replacement string via `$group_number`, .e.g `replaceAll("src=\"([^\"]*)\"","src=\"prefix$1suffix\"")` to surround the attribute content with `"prefix"` and `"suffix". – Thomas Feb 24 '17 at 13:41
  • 2
    However, as Patrick already pointed out regular expressions are no good fit for irregular languages such as hmtl (e.g. what happens with nested tags?) unless you _really_ know _everything_ that is to be expected. As an example, your expression ` – Thomas Feb 24 '17 at 13:46
  • im using a library that generate the element image so are always with teh same style @Thomas tahnks! – Alberto Acuña Feb 24 '17 at 14:48
  • 1
    And if you should ever upgrade to a newer version of that library, are you certain it will continue to generate HTML that can be parsed that way? See also http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg – VGR Feb 24 '17 at 15:24

1 Answers1

2

You could use:

String replaceImg = html.replaceAll(".*<img.*src=\"(.*?)\".*", "Image Url: $1");

This replaces the entire string and the output would be only Image Url: urldirectionstring (note that $1 contains the string matched in the expression, but just the part inside the parenthesis - basically each pair of parenthesis create "groups" that can be referenced later; as the regex contains only one pair, that's the first group, so you can reference it with $1)

If you want to replace only the img tag and keep the other tags intact, you could use:

String replaceImg = html.replaceAll("<img.*src=\"(.*?)\"[^>]*/?>", "Image Url: $1");

In this case, the output will be: &nbsp;asd<i>&nbsp;qwe qwe<u>qweqwe</u></i><u>wqeqwesd.Image Url: urldirectionstring</u>