2

I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body> and </body> here it is:

public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the console
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}

Well it works fine like this but now I would like to extract the text between the tag: <table class="claroTable"> and </table>

So I replace my regex string by ".*?<table class=\"claroTable\".*?>(.*?)</table>.*?" I have also tried ".*?<table class=\"claroTable\">(.*?)</table>.*?" but it doesn't work and I don't understand why. There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?

Thank you in advance for helping me,

EDIT: the html text to extranct is something like:

<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>

What I would like to extract is anything between <table class="claroTable"> and </table>

Alohci
  • 78,296
  • 16
  • 112
  • 156
vallllll
  • 2,731
  • 6
  • 43
  • 77

2 Answers2

6

Here's how you can do it with the JSoup parser:

File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

Yes, you can also somehow do it with regex, but it will never be this easy.

Update: The main problem with your regex pattern is that you are missing the DOTALL flag:

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

And if you just want the specified table tag with contents, you can do something like this:

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");

(Updated: now returns the contents of the table tag only, not the table tag itself)

Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
  • thks Sean Patrick Floyd, it actually works with body tag but I would like to extract the table tag and that one doesn't work: .... data to extract...
    so something like Pattern pattern=Pattern.compile(".*?(.*?)
    .*?")
    – vallllll Aug 29 '11 at 10:21
  • thks Sean Patrick Floyd but it returns me the whole html string again as if nothing happened. I don't understand what the replaceFirst(...) do?? – vallllll Aug 29 '11 at 10:53
  • @vallllll replaceFirst("$1") means replace the String with the first matched group of the first match. I just saw that you want what's between the table tags, will update my answer accordingly. – Sean Patrick Floyd Aug 29 '11 at 11:01
  • @vallllll Then you are doing something wrong. here is a working version of my code: http://ideone.com/9xf9U – Sean Patrick Floyd Aug 29 '11 at 11:21
0

As stated, this is a bad place to use regex. Only use regex when you actually need to, so basically try to stay away from it if you can. Take a look at this post though for parsers:

How to parse and modify HTML file in Java

Community
  • 1
  • 1
Matt
  • 7,049
  • 7
  • 50
  • 77
  • to: Andreas_D and Matt: I know that, but I have to use it. The point here is to use regex I don't have a choice. the programming language doesn't matter but using regex is a requirement so I really would appreaceate some help. thks – vallllll Aug 29 '11 at 09:26
  • @vallllll OK, I have updated my answer to actually address your regex issues. – Sean Patrick Floyd Aug 29 '11 at 09:33