0
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;


public class Main {
    public static void main(String[] args) throws Exception {
        Document d=Jsoup.connect("https://osu.ppy.sh/u/charless").get();

        for(Element line : d.select("div.profileStatLine")) {
            System.out.println(d.select("b").text());
        }
    }
}

I'm having problems getting the text "2027pp (#97,094)" in div.profileStatLine b. This should output, but doesn't. URL: https://osu.ppy.sh/u/charless

1 Answers1

1

Parts of the page are loaded with javascript, which is why you can't see the divs you're looking for.

You can use a browser to load the page and interpret the javascript before parsing. A library like webdrivermanager will help.

public static void main(String[] args) throws Exception {
    ChromeDriverManager.getInstance().setup();
    ChromeDriver chromeDriver = new ChromeDriver();
    chromeDriver.get("https://osu.ppy.sh/u/charless");

    Document d = Jsoup.parse(chromeDriver.getPageSource());

    chromeDriver.close();

    for (Element line : d.select("div.profileStatLine")) {
        System.out.println(line.select("b").text());
    }
}

The alternative is to examine the javascript in the page and make the same calls that it does to retrieve the data.

The page is loading the profile from https://osu.ppy.sh/pages/include/profile-general.php?u=4084042&m=0. It looks like u is simply the user ID, which is relatively simple to extract from the page:

public class ProfileScraper {
    private static final Pattern UID_PATTERN = Pattern.compile("var userId = (\\d+);");

    public static void main(String[] args) throws IOException {
        String uid = getUid("charless");
        Document d = Jsoup.connect("https://osu.ppy.sh/pages/include/profile-general.php?u=" + uid).get();

        for (Element line : d.select("div.profileStatLine")) {
            System.out.println(line.select("b").text());
        }
    }

    public static String getUid(String name) throws IOException {
        Document d1 = Jsoup.connect("https://osu.ppy.sh/u/" + name).get();

        for (Element script : d1.select("script")) {
            String text = script.data();
            Matcher uidMatcher = UID_PATTERN.matcher(text);
            if (uidMatcher.find()) {
                return uidMatcher.group(1);
            }
        }
        throw new IOException("No such character");
    }
}
teppic
  • 7,051
  • 1
  • 29
  • 35
  • So I ran, this and it worked. Only problem would be that the program that I'm making is a bot that will eventually be put on a server, so I'm wondering if this method will still work in that environment. – Charles Baldwin Sep 12 '17 at 01:09
  • Selenium will run headless without any trouble, but it will be fairly heavy for what you're doing. Your other option is to reverse engineer the page and discover what calls the javascript is doing to load the content you're after. – teppic Sep 12 '17 at 01:16