Remove html from a string

Question

String1:

<img alt="" src="http://abcghgds.com/justin-bieber-ferns-650-430.jpg" width="650" height="430" /> Have you seen <a href="http://www.abcdefg.com/between_two_ferns" target="_blank">**Between Two Ferns**</a>?

result1:

**Have you seen** <a   style = "display:inline" href="http://www.abcdefg.com/between_two_ferns" target="_blank">**Between Two Ferns**</a>?

I want to check if String1 ends with String2.
If it does end with String2 , then I want to replace remove string2 from string1.
So n the above case the text in String1 does end with the text in String2.( though the html is different for string1 and string2)

The output that I want is

String1= <img alt="" src="http://abcghgds.com/justin-bieber-ferns-650-430.jpg" width="650" height="430" />

I can't directly say if(String1.endsWith(String2)){} as the html for both is different. So I'll first have to remove html and check if the text in string1 ends with text in string2 and then I want to replace the original string1 ( ie remove string 2 from string1 without altering any html of string1)

Here's what I have tried:

ans1 & ans2 are just texts and I use it only for comparison. I finally need to remove string2 from string1 if string1 rnds with string2. , but at the same time I don't want to alter the html is string1. I don't want all string1 to just be text.

String ans1= Jsoup.parse(string1).text(); 
String ans2 = Jsoup.parse(result1).text();

    if(ans1.endsWith(ans2))
    {   
        string1=string1.replace(result1, ""); 
    }

Can you explain how string1 ends with string2? Your code doesn't do anything because obviously string1 doesn't end with string2. — MxLDevs, May 27 '14 at 15:05
This seems like an X/Y problem to me. What exactly are you trying to achieve by doing this? If your trying to analyse HTML with the Java String API or even regex, you should probably be using a proper parser. You could convert your HTML to XHTML and use one of the XML parser APIs such as DOM or SAX. That would make a great deal more sense. — Rudi Kershaw, May 27 '14 at 15:20
but the text in string1 does end with the text in string2. I'm not sure whether u understand my question — girl24, May 27 '14 at 15:20
Rudi: I have two strings. And I'm simply checking if string1 ends with result1. Now in the above case the text in string1 ends with the text in result1. However the html associated with them is not the same. so if I directly say if string1.endsWith(result1) then it would never match as although the texts are similar the html associated with them is different.So initially I am just checking if the text in string1 ends with text in result1. If yes, then I want to simply get rid of the text in result1 in string1 along with the associated html and keep the rest of the html and text in string1 — girl24, May 27 '14 at 15:23
The only purpose of using JSoup was to compare the text and detect the correct string to be replaced. then I want to replace the original string and not the text string. I hope you got my question now — girl24, May 27 '14 at 15:27
They're never going to be the same, because one contains `style = "display:inline"` and the other doesn't. Why are you trying to compare and what exactly do you want to find equal? Is it the element's contents (as opposed to it's attributes), or a combination of specific attributes (such as href)? Because then you could use Jsoup to get the contents instead of dealing with the HTML in a raw String. — Rudi Kershaw, May 27 '14 at 15:31
initially I am trying to compare the text. then, if the text are equal I just want to remove the text and the associated html from the string1 — girl24, May 27 '14 at 15:37
Okay. Well, I am not entirely familiar with Jsoup but I'll see if I can throw together an answer using what every little I know. — Rudi Kershaw, May 27 '14 at 15:43
I guess endsWith is not the right way. I'll have to use regex mostly. Because using JSOUP I'm just checking if the text is similar. Now that the text is similar, I'm trying to replace it , but it will never match because of html. So I guess will have to try some regex. But I am trying for a generic solution. not a solution pertaining to this example only. Maybe I'll have to try .. remove everything after the first > encountered before the text ( using regex) — girl24, May 27 '14 at 15:45

score 0 · Answer 1 · answered May 27 '14 at 15:14

You are almost close, but endsWith Compares the suffix of the string.

My Suggestions is:

Tried to compare the last 10 digits or some x digits of your choice, which fits your scenario.

In the below case , i trying to compare 10 characters of a string using endsWith.

String ans1= Jsoup.parse(string1).text(); 
String ans2 = Jsoup.parse(result1).text();

    if(ans1.substring(ans1.length()-10,ans1.length()).endsWith(ans2.substring(ans2.length()-10,ans2.length())))
    {   
        string1=string1.replace(result1, ""); 
    }

Hope this solves.

That seems awfully specific to those two strings. That doesn't seem at all extensible. Also, that still wont return true because string1 still doesn't end with results1. — Rudi Kershaw, May 27 '14 at 15:34

Rudi Kershaw · Answer 2 · 2014-05-30T09:11:53.270

Try not to deal with HTML in raw Strings (ever, if you can help it). The Jsoup API should be more than capable of dealing with what you need. From what you've said in the comments I took it that you are trying to roughly achieve the following.

    org.jsoup.nodes.Document s1 = Jsoup.parse(string1); 
    org.jsoup.nodes.Document r1 = Jsoup.parse(result1);

    org.jsoup.nodes.Element str = s1.childNode(1);
    org.jsoup.nodes.Element res = r1.childNode(0);

    if(str.text().equals(res.text())){
        str.remove();
    }

(Not tested, I don't have the Jsoup library plugged in on this computer)

This finds the link Elements in the HTML in your strings and stores them in str and res respectively. If the contents (text) of str is the same as res it removes str from it's parent Element. Ie, from the document you originally stored in s1.

If that were not enough you could also get the value of the href attribute to compare those also.

I hope this helps. Let me know how you get on.

@girl24 - Have a quick read of [Parsing Html The Cthulhu Way](http://blog.codinghorror.com/parsing-html-the-cthulhu-way/), which details why trying to use regex to parse HTML is a bad idea. Also, have a read of [this answer.](http://stackoverflow.com/a/1732454/2182928) — Rudi Kershaw, May 27 '14 at 16:50

Remove html from a string

2 Answers2