5

This is the useful part of code:

java.util.List<Element> elems = src.getAllElements();
Iterator it = elems.iterator();
Element el;
String key,value,date="",place="";
String [] data;
int k=0;
Segment content;
String contentstr;
String classname;

while(it.hasNext()){

    el = (Element)it.next();

    if(el.getName().equals("span"))
    {

            classname=el.getAttributeValue("class");
        if(classname.equals("edit_body"))
        {
            //java.util.List<Element> elemsinner = el.getChildElements();
            //Iterator itinner = elemsinner.iterator();


            content=el.getContent();

            contentstr=content.toString();


            if(true)
            {


                System.out.println("Done!");

                System.out.println(classname);

                System.out.println(contentstr);


            }
       }
    }

}

No output. But if I remove the if(classname.equals("edit_body")) condition it does print (in one of the iterations):

Done!
edit_body
&quot;I honestly think it is better to be a failure at something you love than to be a success at something you hate.&quot;

Can't get the bug part... help!

I am using an external java library BTW for html parsing.

BTW there are two errors at the start of the output, which is there in both the cases, with or without if condition.:

Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: EndTag br at (r1992,c60,p94048) not recognised as type '/normal' because its name and closing delimiter are separated by characters other than white space 

Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: Encountered possible EndTag at (r1992,c60,p94048) whose content does not match a registered EndTagType 

Hope that wont cause the error

Ok guys, Somebody explain me please! "edit_body".equals(el.getAttributeValue("class")) worked!!

arkanath
  • 133
  • 1
  • 1
  • 10
  • 3
    Do a `System.out.println(el.getName())` – Raekye Dec 20 '12 at 06:11
  • Its coming out to be span which it should be – arkanath Dec 20 '12 at 06:13
  • you code is lacking key parts before we can even start helping.Does src.getAllElements(); actually output anything ? what is the API for Element#getName ? Assuming the equals on string doenst work is just so wrong, you really think that java would be still alive if the equals method on String was not working. In general when someone think the java API is broken then 99.999% of the times its not java but their own code. – Peter Dec 20 '12 at 07:04
  • Well, ofcourse `src.getAllElements` is giving output as iterator works perfectly if i remove the given `if` condition.. getName() gives you the name of the tag in string...http://jericho.htmlparser.net/docs/javadoc/index.html And its obvious that I dont think that equals method is faulty or the JAVA API is broken, otherwise i wouldn't have asked for your help.. The title of the question is the closest phrase i could have given! – arkanath Dec 20 '12 at 07:15
  • I've never had a problem with this but try converting both strings to the same charset? http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#getBytes(java.lang.String) Then compare array of bytes. – Raekye Dec 20 '12 at 07:16
  • Ok guys!!!!! Somebody explain me please! `"edit_body".equals(el.getAttributeValue("class"))` worked!! *BAZINGA!!* – arkanath Dec 20 '12 at 08:14
  • if your string contains spaces in between words, i suggest you use compareTo() method, as it compares character by character – Dilini Peiris Aug 02 '19 at 12:14

5 Answers5

17

I had right now the exactly same problem.

I success to solve it by using: SomeStringVar.replaceAll("\\P{Print}","");.

This command remove all the Unicode characters in the variant (characteres that you cant see- the strings look like equal, even they not really equal).

I use this command on each variant i needed in the equalization, and it works for me as well.

ldoroni
  • 629
  • 1
  • 7
  • 18
12

Looks like you are having leading or trailing whitespaces in your classname.

Try using this: -

if(classname.trim().equals("edit_body"))

This will trim any of those whitespaces at the ends.

Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
  • @Arkanath.. Are you sure? Try replacing your `System.out.println(classname);` with: - `System.out.println("*" + classname + "*");` and see what you get. Of course first remove that `if`. – Rohit Jain Dec 20 '12 at 06:18
  • Unfortunately, Its coming out to be *edit_body* !... The edit_body text got italicized due to the *s – arkanath Dec 20 '12 at 06:20
  • @Arkanath.. Ah! Sorry. `*` made it italic I think. Replace `*` with `-`, and see whether the output you get is : - `- edit_body -`, with spaces or `-edit_body-` without spaces. – Rohit Jain Dec 20 '12 at 06:22
  • @Arkanath.. Well, now that's something strange. Please show use the HTML part that you are trying to parse. Only the part that contains that text. – Rohit Jain Dec 20 '12 at 06:25
  • BTW there are two errors at the start of the output, which is there in both the cases, with or without if condition.: `Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: EndTag br at (r1992,c60,p94048) not recognised as type '/normal' because its name and closing delimiter are separated by characters other than white space Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: Encountered possible EndTag at (r1992,c60,p94048) whose content does not match a registered EndTagType` Hope that wont cause the error – arkanath Dec 20 '12 at 06:26
  • @Arkanath.. May be you are having some other unprintable unicode characters, that is preventing your string to match exactly. You can see this post: - http://stackoverflow.com/questions/6772221/what-is-the-better-approach-to-trim-unprintable-characters-from-a-string on how to remove the unprintable characters. I can only see this possible reason for this behaviour for now. – Rohit Jain Dec 20 '12 at 06:39
  • @Arkanath. You can also do: - `classname.replaceAll("\\p{C}", "");` before comparison. – Rohit Jain Dec 20 '12 at 06:40
  • @Downvoter.. Please Care to Comment, if you are downvoting any answer. – Rohit Jain Dec 20 '12 at 06:45
  • did both `classname.replaceAll("[^\\x20-\\x7e]", "")` and `classname.replaceAll("\\p{C}", "")`, NO HELP.. really freaking out :) Very strange! – arkanath Dec 20 '12 at 06:49
  • @Arkanath.. I'm afraid man. But it seems like error is really somewhere else, which we can't solve by looking at your current code. You need to look at your HTML clearly. Probably validate it in some way. Because that is the only thing that might be causing problem here. – Rohit Jain Dec 20 '12 at 06:51
  • Yeah, that is what i am doing.. BTW link to the html is http://www.great-quotes.com/quotes/category/Funny/pg/1.. If you need! – arkanath Dec 20 '12 at 06:55
  • @Arkanath.. That is a website? You mean the problem is with it's source? – Rohit Jain Dec 20 '12 at 06:56
  • As you know,I dont know where is the problem! I am just ruling out every possible mistake.. – arkanath Dec 20 '12 at 07:02
  • this is how i defined the source: `String url_str = "http://www.great-quotes.com/quotes/category/Funny/pg/1"; System.out.println(url_str); URL url = new URL(url_str); Source src = new Source(url);` – arkanath Dec 20 '12 at 07:04
  • @Arkanath.. What?? Wow that is wierd. I mean this is really like breaking the rules. `s1.equals(s2)` should give the same result as `s2.equals(s1)`, if of course none of them is `null`. This is one of the contracts of `equals` method. I'm now more surprised as to what's the real problem. – Rohit Jain Dec 20 '12 at 08:28
  • Whatever the problem is, its a great sigh of relief, if you understand what it means!I've been at this since yesterday night! – arkanath Dec 20 '12 at 08:29
  • @Arkanath.. Hmm. Well that is a positive point though. But, just to confirm, can you try: - `el.getAttributeValue("class").equals("edit_body")` and see if it works? I mean, it should work. It must work. There is no 2nd behaviour for this. – Rohit Jain Dec 20 '12 at 08:31
  • no dude its not working, it seems like some problem of encoding – arkanath Dec 20 '12 at 08:41
  • @Arkanath.. Ah! Leave it. Just assume that you never faced this behaviour. Because this is unexpectable. Don't know why this is happening. Anyways, you got your job done. That's it. – Rohit Jain Dec 20 '12 at 08:44
  • hehe... yes i am leaving it for now.. but will look into this later.. interesting situation! – arkanath Dec 20 '12 at 08:47
2

Firstly, String.equals() is NOT broken. It works for millions of other programs / programmers. This is NOT the cause of your problems (unless you or someone has deliberately modified ... and broken your Java installation ...)

So why can two apparently equal strings compare as unequal?

  1. There could be leading or trailing whitespace characters on the String.
  2. There could be embedded non-printing characters.
  3. There could be pairs Unicode characters that look the same when you display them with a typical font, but in fact are not the same. For instance the Greek code page contains characters that look by Latin vowels ... but are in fact different codes, and hence are not equal.
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • 2
    DID I Say String.equals() is broken??I said its not working i.e. not working for me!! Anyway, you explain me this: `"edit_body".equals(el.getAttributeValue("class"))` worked! – arkanath Dec 20 '12 at 08:19
  • It sounds like what would happen in my scenario #3. Specifically, the version of your code that doesn't work has one of those "looks-like-a-Latin-letter-but-isn't" characters embedded in the source code. Or maybe it is in the web page (though that seems unlikely given you got the new version of your code to work.) – Stephen C Dec 20 '12 at 08:26
  • Not working == broken. Not working for me != not working. There is not doubt in my mind that the String.equals method is working exactly as specified, and that the result you are getting is exactly as the specification says it should be. The problem is in the way that you are using it. You just need to be *forensic* in the way that you debug the problem. – Stephen C Dec 20 '12 at 08:32
  • yes thats the point, what happened in the new version?? I mean its surprising to me! – arkanath Dec 20 '12 at 08:35
  • Please read what I've written ... I've explained (twice now!!) what appears to be the root cause. I've no idea how you managed to get the "funky" character into your source code ... – Stephen C Dec 20 '12 at 08:36
  • I have understood your reason... I just want to know that is there a difference between `a.equals("b")` and `"b".equals(a)`? Does `"b".equals(a)` removes the "funky" characters from a? – arkanath Dec 20 '12 at 08:45
  • Nope. The problem is the one of the string literals contains a funky character and the other doesn't. You should be able to spot it if you look at the source file using some utility that will give you a hex dump of the characters. – Stephen C Dec 20 '12 at 09:56
0

change the code to:

classname="edit_body"; //<- hardcode 

if(classname.equals("edit_body"))

if the code enters the if statement now, then there must obviously be some difference in the string content when you use the original "classname=el.getAttributeValue("class");". in such case, loop over the individual characters and compare those to find the difference.

If the code still doesnt enter the if statement, either your code is not compiling and you are running old code, or your java installation is broken ;-)

OR.

if java is anything like .net (I don't know java) is "el.getAttributeValue" typed as string? if it is typed as object, then the if statement would not enter since those are two different instances of the same string.

Roger Johansson
  • 22,764
  • 18
  • 97
  • 193
0

equals() is a method of String class. So, it works with double quotes.

 if(someString.equals("something")) ✓
 if(someString.equals('something')) ×
Florian Gössele
  • 4,376
  • 7
  • 25
  • 49
ZBorkala
  • 366
  • 3
  • 13