0

I've come across a problem that seems really weird to me.

I'm scraping a website using Jsoup:

Elements names = doc.select(".Mod.Thm-inherit").select("h3");

for (Element e : names) {
    System.out.println(e.text());
}

My output is (Fantasy hockey team names, names changed for simplicity):

Team One ?
Team Two ?
Team Three ?
Team Four ?
Team Five ? 
//etc

Now the actual team names don't have the extra space or question mark. Thinking I could just replace it, I tried:

String str = e.text().replaceAll("\\?", "");
System.out.println(str);

This however still outputs the question mark at the end. I'm thinking that this might mean that it's a character that Eclipse/Java doesn't recognize. (Note: It doesn't display a �, it's really just the generic ?)

When looking at the HTML code, there are no extra characters though:

<script charset="utf-8" type="text/javascript" language="javascript">
<!-- Bunch of HTML -->
<div class="Grid-u-1-2 Pend-xl"><h3 class="My-xl Ta-c Fz-lg"><a href="/hockey/27381/1">Team One</a>

Anyone know why this is happening?

Edit: I was quickly able to solve the issue by just doing a substring and removing the last 2 characters, but I'd still like to know why it's happening.

Edit2: Playing around with it more, I found that if I (int) cast the question mark, it gives me 57399, instead of ?'s regular 63. So definitely some sort of unknown character issue. Just not sure why it's being added or what that character is supposed to represent.

Tiberiu
  • 990
  • 2
  • 18
  • 36
  • 1
    Not enough information/context. You need to reproduce the problem with a minimal HTML/CSS and then provide that so someone can reproduce the problem. – Jim Garrison Oct 30 '14 at 03:42
  • I'd link to the page itself but it's only accessible after a login. Also I'd gladly produce a minimal HTML but I wouldn't even know where to start. I don't have much HTML knowledge and this particular website has a pretty complex structure. Short of providing the entire source HTML, there isn't much I can do – Tiberiu Oct 30 '14 at 04:16
  • I was hoping this was known behaviour that someone with more experience would be able to instantly identify – Tiberiu Oct 30 '14 at 04:17
  • Part of the problem is that with so little code in you post it's hard to tell what's going on. For example, the ` – Jim Garrison Oct 30 '14 at 05:19

1 Answers1

0

I think there must be extra h3 fields with strange characters inside your ".Mod.Thm-inherit"element.

For a complete solution you must provide more information as @Jim Garrison said.

The following code:

    String html ="<div class=\"Grid-u-1-2 Pend-xl\"><h3 class=\"My-xl Ta-c Fz-lg\"><a href=\"/hockey/27381/1\">Team One</a>";
    Document doc = Jsoup.parse(html);
    Elements names = doc.select("h3");
    for (Element e : names) {
        System.out.println(e.text());
    }

Gives me the expected output Team One. With no strange characters at all.

Hope it helps. Best regards.

fonkap
  • 2,469
  • 1
  • 14
  • 30