web scraping using Regex in java

Question

I'm trying to write a crawler to get the menu items from a site using regex in java. The website url is http://www.dinebombaygarden.com/appetizers.html

How can I get the menu items (Vegetable Pakpora, Onion or Spinach or Potato Pakora ...) using Pattern and Matcher?

My code is as follows, but not woking good.

public ArrayList<String> getMenuItems(String menuURL, String menuRegex) throws IOException{
    ArrayList<String> items = new ArrayList<String>();
    Document doc = Jsoup.connect(menuURL).post();
    String text = doc.body().text();
    System.out.println(text);
    Pattern pattern = Pattern.compile(menuRegex);
    Matcher matcher = pattern.matcher(text);
    while(matcher.find()){
        items.add(matcher.group());
    }
    return items;
}

String menuURL = "http://www.dinebombaygarden.com/appetizers.html";
String menuRegex = "[A-Z][a-z]+.{10,50}[$]\\s[\\d.]+.95";

The menuRegex here is not working good. Anyone can help with this issue?

Thank you very much.

Yes, don't use regexp to parse HTML (or XML). Use an HTML parser to do that. — Guillaume Polet, Apr 24 '12 at 14:11
Check out [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — Sergey Kalinichenko, Apr 24 '12 at 14:13
I think this is well-defined and simple enough to be properly handled by a regex. The knee-jerk reaction is not necessarily *always* the best. — mellamokb, Apr 24 '12 at 14:18

score 1 · Answer 1 · answered Apr 24 '12 at 14:15

You have a few issues with your regular expression:

[A-Z][a-z]+ applies the + only to the [a-z], and will not handle spaces properly (i.e., it will only match Pakora in Vegetable Pakora).
You need to escape . in .{10,50}, otherwise it's matching any character rather than the period specifically: \.{10,50}.

Here's a regular expression that will match correctly, and capture the name of the food as well as the price in the capture groups:

\<h3\>([^.]+)\.{10,50}[$]\s([\d.]+.95)

It works by finding the <h3> tags, and then capturing all text before the first period as the name of the food. The rest is the same as your original regex, except I've added capturing around the price.

Demo: http://www.rubular.com/r/I7Hyk4cAI0

score 0 · Answer 2 · answered Apr 24 '12 at 14:25

You can use the Java API of Selenium to interact with web pages.

For example:

WebDriver driver = new FirefoxDriver();
driver.get("http://www.dinebombaygarden.com/appetizers.html");
List<WebElement> menuElements = driver.findElements(By.cssSelector("#content-center .left-data > h3"));
// now iterate through the elements and get the contents with .getText()

Also, i am the developer of Abmash which could also be an alternative. It allows you doing the same job on a more visual way without knowing anything about the source code. Example:

Browser browser = new Browser("http://www.dinebombaygarden.com/appetizers.html");
HtmlElements menuElements = browser.query(headline(), below(headline("appetizers"))).find();
// now iterate through the elements and get the contents with .getText()

More info on Selenium: http://seleniumhq.org/

More info on Abmash: https://github.com/alp82/abmash

score 0 · Answer 3 · answered Apr 24 '12 at 14:41

0

try http://jsoup.org

Document doc = Jsoup.connect("http://www.dinebombaygarden.com/appetizers.html").get();
Elements newsHeadlines = doc.select("div.left-data h3");

answered Apr 24 '12 at 14:41

Anton

1,432
13
17

score 0 · Answer 4 · answered Apr 24 '12 at 14:47

0

not the best regex, but this will do the job

String menuRegex = "['A-Za-z\\s]+\\.{10,50}[$][\\s]*[0-9]*\\.?[0-9]+";

answered Apr 24 '12 at 14:47

coderplus

5,793
5
34
54

web scraping using Regex in java

4 Answers4