1

I'm trying to write a crawler to get the menu items from a site using regex in java. The website url is http://www.dinebombaygarden.com/appetizers.html

How can I get the menu items (Vegetable Pakpora, Onion or Spinach or Potato Pakora ...) using Pattern and Matcher?

My code is as follows, but not woking good.

public ArrayList<String> getMenuItems(String menuURL, String menuRegex) throws IOException{
    ArrayList<String> items = new ArrayList<String>();
    Document doc = Jsoup.connect(menuURL).post();
    String text = doc.body().text();
    System.out.println(text);
    Pattern pattern = Pattern.compile(menuRegex);
    Matcher matcher = pattern.matcher(text);
    while(matcher.find()){
        items.add(matcher.group());
    }
    return items;
}

String menuURL = "http://www.dinebombaygarden.com/appetizers.html";
String menuRegex = "[A-Z][a-z]+.{10,50}[$]\\s[\\d.]+.95";

The menuRegex here is not working good. Anyone can help with this issue?

Thank you very much.

horatio.mars
  • 559
  • 3
  • 9
  • 17

4 Answers4

1

You have a few issues with your regular expression:

  1. [A-Z][a-z]+ applies the + only to the [a-z], and will not handle spaces properly (i.e., it will only match Pakora in Vegetable Pakora).
  2. You need to escape . in .{10,50}, otherwise it's matching any character rather than the period specifically: \.{10,50}.

Here's a regular expression that will match correctly, and capture the name of the food as well as the price in the capture groups:

\<h3\>([^.]+)\.{10,50}[$]\s([\d.]+.95)

It works by finding the <h3> tags, and then capturing all text before the first period as the name of the food. The rest is the same as your original regex, except I've added capturing around the price.

Demo: http://www.rubular.com/r/I7Hyk4cAI0

mellamokb
  • 56,094
  • 12
  • 110
  • 136
0

You can use the Java API of Selenium to interact with web pages.

For example:

WebDriver driver = new FirefoxDriver();
driver.get("http://www.dinebombaygarden.com/appetizers.html");
List<WebElement> menuElements = driver.findElements(By.cssSelector("#content-center .left-data > h3"));
// now iterate through the elements and get the contents with .getText()

Also, i am the developer of Abmash which could also be an alternative. It allows you doing the same job on a more visual way without knowing anything about the source code. Example:

Browser browser = new Browser("http://www.dinebombaygarden.com/appetizers.html");
HtmlElements menuElements = browser.query(headline(), below(headline("appetizers"))).find();
// now iterate through the elements and get the contents with .getText()

More info on Selenium: http://seleniumhq.org/

More info on Abmash: https://github.com/alp82/abmash

Alp
  • 29,274
  • 27
  • 120
  • 198
0

try http://jsoup.org

Document doc = Jsoup.connect("http://www.dinebombaygarden.com/appetizers.html").get();
Elements newsHeadlines = doc.select("div.left-data h3");
Anton
  • 1,432
  • 13
  • 17
0

not the best regex, but this will do the job

String menuRegex = "['A-Za-z\\s]+\\.{10,50}[$][\\s]*[0-9]*\\.?[0-9]+";
coderplus
  • 5,793
  • 5
  • 34
  • 54