Regex - find various strings from an HTML file

Question

I have an html file called basic.html and what my task is, is to create a small Java program using regular expressions to output various strings. My program should display the line number of all of the occurrences of each of the strings below:

div tag
div class="menuItem" tag
span tag
class=”emph”
Any string beginning with < and ending with >, i.e. all tags.
The contents of the body tag.
The contents of all divs
All divs that make menus

I must also use start and end methods to display index values.

I have started my code as follows:

import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexHTML {
   public static void main(String[] args) throws IOException {

      // Input for matching the regexe pattern
       String file_name = "basic.html";

           ReadFile file = new ReadFile(file_name);
           String[] aryLines = file.OpenFile();  
           String asString = Arrays.toString(aryLines);

            // Regexe to be matched
               String regexe = "<div>";

           int i;
           for ( i=0; i < aryLines.length; i++ ) {
           System.out.println( aryLines[ i ] ) ;
           }



      // Step 1: Allocate a Pattern object to compile a regexe
      Pattern pattern = Pattern.compile(regexe);
      //Pattern pattern = Pattern.compile(regexe, Pattern.CASE_INSENSITIVE);  // case-    insensitive matching

      // Step 2: Allocate a Matcher object from the compiled regexe pattern,
      //         and provide the input to the Matcher
      Matcher matcher = pattern.matcher(asString);

      // Step 3: Perform the matching and process the matching result
      int count = 0;
      // Use method find()
      while (matcher.find()) {     // find the next match
         System.out.println("find() found the pattern \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
          count++;

      }
      System.out.println("\nFound the pattern "+count+ " times.\n");

      // Use method matches()
      if (matcher.matches()) {
         System.out.println("matches() found the pattern \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
      } else {
         System.out.println("matches() found nothing");
      }

      // Use method lookingAt()
      if (matcher.lookingAt()) {
         System.out.println("lookingAt() found the pattern \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
      } else {
         System.out.println("lookingAt() found nothing");
      }

   }

}

My biggest problem is how exactly am I going to be able to display all those occurrences, my code so far only gives me the index value of the div tag but I would like to have all the occurrences listed above displayed in the output. My second problem of course is how to display the line every string occurs but I haven't really researched this yet as I'm thinking about the first question at the moment. However If you could give me a hint as to where to get started on this one too, I would appreciate it.

HTML is not a regular language to be parsed with regular expressions. http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — dcernahoschi, Feb 28 '12 at 15:41
Certain things you want to do are do-able with regexes. For others, regexes are not the right tool. `The contents of all divs` is difficult/impossible when there are nested `div` tags. Use a parser for this. — beerbajay, Feb 28 '12 at 15:47
The arbitrary constraints (i.e. use RegEx) prompt me to ask: Is this [tag:homework]? — Andrew Thompson, Feb 28 '12 at 16:16

score 2 · Answer 1 · edited Feb 28 '12 at 15:58

2

One way is to apply each regex to the String[] aryLines individual lines. The line number is the index.

What are you going to do if the phrase you're looking for spans multiple lines? That's valid in HTML... Also, let me be the first to say a regex will not solve this problem in the general case.

edited Feb 28 '12 at 15:58

answered Feb 28 '12 at 15:45

Tony Ennis

12,000
7
52
73

score 1 · Answer 2 · answered Feb 28 '12 at 16:01

1

You really shouldn't use a regular expression to parse HTML, try an existing library such as JSoup. I'm sure that you'd rather not spend your time reinventing HTML parsing!

answered Feb 28 '12 at 16:01

Regex - find various strings from an HTML file

2 Answers2