2

I need to extract some data from an website and then save some values in variables.

Here you've got the code:

public class Principal {

 public static void main(String[] args) throws IOException {

    URL url = new URL("http://www.numbeo.com/cost-of-living/country_result.jsp?country=Turkey");
    URLConnection yc = url.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
    String inputLine;
            String valor;
            String str = null;

    while ((inputLine = in.readLine()) != null) {
                    if(inputLine.contains("Milk"))
                    {
                         System.out.println("Encontrei! " + inputLine );
                         valor=inputLine.substring(inputLine.lastIndexOf("\"priceValue\">") + 14);
                         System.out.println("valor:" +valor);

                    }

            }
    in.close();
    }

  }

First inputline print this: <tr class="tr_standard"><td>Milk (regular), (1 liter) </td> <td style="text-align: right" class="priceValue"> 2.45&nbsp;TL</td>

Now I've got to extract just the "2.45" how can I do that? I already tried with some Regex but can't make it work. Sorry for my English. Thanks in advance.

user3088049
  • 91
  • 1
  • 8

2 Answers2

2

You can try following regex:

(?:class="priceValue">\s*)(\d*\.\d+)

It looks for a class="priceValue"string followed by a price

Here is DEMO and explanation

MaxZoom
  • 7,619
  • 5
  • 28
  • 44
  • Hi, thanks! I tried like this `str = valor.replaceAll("(?:class=\"priceValue\">\\s+)([\\d.]+)",""); System.out.println("valor:" +str);´ But the println shows: valor:2.45 TL – user3088049 Nov 17 '15 at 22:23
  • You should use `matcher` – MaxZoom Nov 17 '15 at 22:25
  • like this? ` valor.matches("(?:class=\"priceValue\">\\s+)([\\d.]+)");` – user3088049 Nov 17 '15 at 22:27
  • 1
    I would discourage using regex for HTML parsing, even in such a simple case. It's usually much more complicated than it looks, and just an anti-pattern. What happens if the `class` attribute is not the last one in an element? It would still be valid, but this solution would not work - see [a test regex with this condition, and no matching result](https://regex101.com/r/oF4yX6/1). For a good (and humorous) reference, please see [this question about using regex to parse [X]HTML](http://stackoverflow.com/a/1732454/1663942). I would recommend @JockX's answer. – Juan Carlos Coto Nov 18 '15 at 02:53
2

I know you are asking for regex, but consider making your life easier by parsing the HTML as if it was a structured XML document it is rather than a normal string. There are libraries that would handle this for you, and stop you from worrying about text formatting, legal linebreaks and other stuff:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.1</version>
</dependency>

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class HtmlParser {
    public static void main(String[] args) {

        Document doc;
        try {
            doc = Jsoup.connect("http://www.numbeo.com/cost-of-living/country_result.jsp?country=Turkey").get();
            Elements rows = doc.select("table.data_wide_table tr.tr_standard"); // CSS selector to find all table rows
            for (Element row : rows) {
                System.out.println("Item name: " + row.child(0).text()); // Milk will be here somewhere
                System.out.println("  Item price by column number: " + row.child(1).text());
                System.out.println("  Item price by column class:  " + row.getElementsByAttributeValue("class", "priceValue").get(0).text());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

/**
 Output:
 Item name: Meal, Inexpensive Restaurant
   Item price by column number: 15.00 TL
   Item price by column class: 15.00 TL
 Item name: McMeal at McDonalds (or Equivalent Combo Meal)
  Item price by column number: 15.00 TL
  Item price by column class: 15.00 TL
...
*/
JockX
  • 1,928
  • 1
  • 16
  • 30