0

How can I get the String representation of what is displayed on a tab when opening a website in a browser? Let's say, if I opened http://www.stackoverflow.com, is it possible to extract "Stack Overflow" String, as it's shown here:

Stack Overflow tab

I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.

shooqie
  • 950
  • 7
  • 17
  • http://stackoverflow.com/questions/24237036/how-to-get-name-of-website-from-any-string-url also maybe http://stackoverflow.com/questions/5919476/how-to-take-title-text-from-any-web-page-in-java also maybe http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/ – user1274820 Sep 08 '15 at 18:00

2 Answers2

4

I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.

java.net.URL won't do it, no, you need an HTML parser like JSoup. Then you just take the content of the title tag in the head.

E.g., assuming you have a URL:

Document doc = Jsoup.connect(url).get();
Element titleElement = doc.select("head title").first(); // Or just "title", it's always supposed to be in the head
String title = titleElement == null ? null : titleElement.text();
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 1
    You don't "need" an HTML parser. It's easier to use one, but it is not needed. It always frustrates me when the answer for something is "implement an entire suite of tools to handle one simple task" – user1274820 Sep 08 '15 at 18:03
  • Good answer, you could improve it with an example. – MirMasej Sep 08 '15 at 18:04
  • 2
    @user1274820: You *do* need an HTML parser. You [cannot reliably parse HTML without one](http://stackoverflow.com/a/1732454/157247). Now, it can be a very simple one targeted only at extracting this information, or a powerful general-purpose one like JSoup. But you need a parser of some kind. – T.J. Crowder Sep 08 '15 at 18:06
  • Okay, any code you write to perform this task is technically a parser - what I mean is that you do not need an entire html parsing library to do get one piece of information. – user1274820 Sep 08 '15 at 18:10
  • 2
    @user1274820: I've tried to go down that path. "Oh, I don't really need a parser for this, I can just..." It's a massive waste of time. Barring a **really** good reason (like, embedded systems and not having room), it's better to just use the right tool for the job. – T.J. Crowder Sep 08 '15 at 18:11
  • @user1274820: See my comment on it for why we parse. :-) But hey, yes, that kind of thing can be a 97% solution, no question. – T.J. Crowder Sep 08 '15 at 18:21
  • Thank you T.J. That was my only point - it is not impossible to do with the standard library. I dislike seeing someone say that it is not possible - easier using a suite, yes, but I prefer code that I can tweak as needed to software libraries that rely on their creators to update and bugfix. – user1274820 Sep 08 '15 at 18:28
0

Look for following pattern in reponse -

private static final Pattern TITLE_TAG = Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

One more solution as parsing HTML using regex is not considered good -

javax.swing.text.html.HTMLDocument

URL url = new URL('http://yourwebsitehere.com');
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);

HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
String title = (String) htmlDoc.getProperty(HTMLDocument.TitleProperty);
System.out.println('HTMLDocument Title: ' + title);
Raman Shrivastava
  • 2,923
  • 15
  • 26