Help extracting text from html tag with Java and Regex

Question

I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body> and </body> here it is:

public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the console
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}

Well it works fine like this but now I would like to extract the text between the tag: <table class="claroTable"> and </table>

So I replace my regex string by ".*?<table class=\"claroTable\".*?>(.*?)</table>.*?" I have also tried ".*?<table class=\"claroTable\">(.*?)</table>.*?" but it doesn't work and I don't understand why. There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?

Thank you in advance for helping me,

EDIT: the html text to extranct is something like:

<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>

What I would like to extract is anything between <table class="claroTable"> and </table>

If you want to extract data from html: use an html parser. If you want to learn RegExp: do **not** use html or xml input. Sooner or later you'll realize, that regexp'ing html doesn't work. — Andreas Dolk, Aug 29 '11 at 09:10
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — NimChimpsky, Aug 29 '11 at 09:27
---------- Hope this link would help to give sample code of extractor from HTML:
http://bejavadeveloper.blogspot.in/ — Jignesh Vachhani, Oct 16 '12 at 09:15

Sean Patrick Floyd · Accepted Answer · 2011-08-29T11:02:15.777

6

Here's how you can do it with the JSoup parser:

File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

Yes, you can also somehow do it with regex, but it will never be this easy.

Update: The main problem with your regex pattern is that you are missing the DOTALL flag:

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

And if you just want the specified table tag with contents, you can do something like this:

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");

(Updated: now returns the contents of the table tag only, not the table tag itself)

edited Aug 29 '11 at 11:02

answered Aug 29 '11 at 09:24

Sean Patrick Floyd

292,901
67
465
588

thks Sean Patrick Floyd, it actually works with body tag but I would like to extract the table tag and that one doesn't work: .... data to extract...
so something like Pattern pattern=Pattern.compile(".*?(.*?)
.*?") – vallllll Aug 29 '11 at 10:21
thks Sean Patrick Floyd but it returns me the whole html string again as if nothing happened. I don't understand what the replaceFirst(...) do?? – vallllll Aug 29 '11 at 10:53
@vallllll replaceFirst("$1") means replace the String with the first matched group of the first match. I just saw that you want what's between the table tags, will update my answer accordingly. – Sean Patrick Floyd Aug 29 '11 at 11:01
@vallllll Then you are doing something wrong. here is a working version of my code: http://ideone.com/9xf9U – Sean Patrick Floyd Aug 29 '11 at 11:21

score 0 · Answer 2 · edited May 23 '17 at 12:00

0

As stated, this is a bad place to use regex. Only use regex when you actually need to, so basically try to stay away from it if you can. Take a look at this post though for parsers:

How to parse and modify HTML file in Java

edited May 23 '17 at 12:00

Community

1
1

answered Aug 29 '11 at 09:20

Matt

7,049
7
50
77

to: Andreas_D and Matt: I know that, but I have to use it. The point here is to use regex I don't have a choice. the programming language doesn't matter but using regex is a requirement so I really would appreaceate some help. thks – vallllll Aug 29 '11 at 09:26
@vallllll OK, I have updated my answer to actually address your regex issues. – Sean Patrick Floyd Aug 29 '11 at 09:33

Help extracting text from html tag with Java and Regex

2 Answers2

Linked