2

First let me tell you where I'm coming from. I have a string that is the html code from a website, i got this using JSOUP. Anyways so the html is all in the string and I can print it to a text file. So I'm trying to get songs from inside this code and each song is by the same "tags"

this is a line from the text file i printed it to

          <div class="title" itemprop="name">
           Wrath
          </div> </td> 

In notepad it looks like a line but when you copy and paste it it looks like this. So what I want is the wrath in the middle so i tried to make a pattern to find it using help from this other stack post:Java regex to extract text between tags

This is the part of my code that has to do with this

Pattern p = Pattern.compile( "<div class=\"title\" itemprop=\"name\">(.+?)</div> </td>");
    Matcher m = p.matcher( html );
    while( m.find()) {
       quote.add( m.group( 1 ));
    }

When it runs it shows that there is nothing in the ArrayList quote. This might not be working because it counts the space in between. Any Ideas?

Community
  • 1
  • 1
Kasarrah
  • 315
  • 2
  • 4
  • 14
  • Try using [XPath instead.](http://docs.oracle.com/javase/7/docs/api/javax/xml/xpath/package-summary.html) –  Jun 22 '15 at 01:08

2 Answers2

4

You can use jsoup to parse as well as download your HTML document:

String site = "http://example.com/";
Document doc = Jsoup.connect(site).get();
String text doc.select("div.title").first().text();

Or just use XPath if that doesn't work. Regular expressions are great for picking out data from unstructured text. When you have a structured document like HTML, however, you can leave all of the heavy lifting to a purpose-built parser. Java ships with the javax.xml.xpath library, with which you can search the node tree of your document.

Let's say your document looks like this:

<html>
  <body>
    <div class="title">Wrath</div>
  </body>
</html>

You could do this to find the text in that div:

XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "/html/body/div[@class='title']/text()";
InputSource inputSource = new InputSource("myDocument.html");
NodeList nodes = (NodeList) xpath.evaluate(expression, inputSource, XPathConstants.NODESET);
  • I cant use the first part because I cant know that Wrath is already in there, it can be any name. As for the XPath code that you give me it seems im getting some errors such as malformedURLexcpetion: no protocol: and then it shows the file afterwards – Kasarrah Jun 22 '15 at 01:33
  • 2
    I presume that you are downloading a webpage from somewhere? In that case, you can parse the `String` you downloaded into a JSOUP `Document`, and then just use `doc.select("div.title").text()` to get the text in question. –  Jun 22 '15 at 01:38
  • Ah!! That worked for the most part, it got all of the songs and just a little extra stuff. Thank you so much! – Kasarrah Jun 22 '15 at 01:43
  • Not a problem. I'm glad you've found a solution! Also, take a closer look at the [`jsoup` documentation](http://jsoup.org/cookbook/). It is based on the XPath library, and it is very powerful. –  Jun 22 '15 at 01:45
0

If it parses like Perl you may have to double up on the \

Pattern p = Pattern.compile("<div class=\"title\" itemprop=\"name\">(.*?)<\\/div>");

Should be

Pattern p = Pattern.compile("<div class=\"title\" itemprop=\"name\">(.*?)<\\\\/div>");

But for this kind of thing a Regex is the wrong tool

JGNI
  • 3,933
  • 11
  • 21