3


Lets say I copy a complete HTML table (when each and every tr and td has extra attributes) into a String. How can I take all the contents (what is between the tags) and create an 2D array that is organized like the original table?

For example for this table:

<table border="1">
    <tr align= "center">
        <td align="char">TD1</td>
        <td>td1</td>
        <td align="char">TD1</td>
        <td>td1</td>
    </tr>
    <tr>
        <td>TD2</td>
        <td>tD2</td>
        <td class="bold>Td2</td>
        <td>td2</td>
    </tr>
</table>

I want this array: array

PS: I know I can use regex but it would be extremely complicated. I want a tool like JSoup that can do all the work automatically without much code writing

RE6
  • 2,684
  • 4
  • 31
  • 57
  • If HTML is valid you can use SAX XML parser or HTMLCleaner http://htmlcleaner.sourceforge.net/. And there are a lot of other libs that helps to parse html. Just check this list: http://java-source.net/open-source/html-parsers – Sergii Stotskyi Aug 15 '12 at 10:41
  • You are actually asking for the algorithm that will parse your table string to data array? – Less Aug 15 '12 at 10:41
  • I have just added that I want a simple tool like JSoup that does the work automatically without much code writing and analyzing – RE6 Aug 15 '12 at 10:42

5 Answers5

12

This is how it could be done using JSoup (srsly, don't use regexp for HTML).

Document doc = Jsoup.parse(html);
Elements tables = doc.select("table");
for (Element table : tables) {
    Elements trs = table.select("tr");
    String[][] trtd = new String[trs.size()][];
    for (int i = 0; i < trs.size(); i++) {
        Elements tds = trs.get(i).select("td");
        trtd[i] = new String[tds.size()];
        for (int j = 0; j < tds.size(); j++) {
            trtd[i][j] = tds.get(j).text(); 
        }
    }
    // trtd now contains the desired array for this table
}

Also, the class attribute value is not closed properly here in your example:

<td class="bold>Td2</td>

it should be

<td class="bold">Td2</td>
Community
  • 1
  • 1
Jens
  • 16,853
  • 4
  • 55
  • 52
5

Maybe String.split('<whateverhtmltabletag>') can help you?

Also StringTokenizer class can be useful. Example:

String data = "one<br>two<br>three";  
StringTokenizer tokens = new StringTokenizer(data, "<br>");  
while (tokens.hasMoreElements()) {  
   System.out.println(tokens.nextElement());  // prints one, then two, then three
}

Also, using indexOf("<tag"), example here: http://forums.devshed.com/java-help-9/parse-html-table-into-2d-arrays-680614.html

You can also use an HTML parser (like jsoup) and then copy the contents from the table to an array. Here's an example in javascript: JavaScript to parse HTML table of numbers into an array

Community
  • 1
  • 1
NotGaeL
  • 8,344
  • 5
  • 40
  • 70
0

Nevermind, I saw this code in the internet: HtmlTableParser

It actually seems that now I have another problem, but it is not exactly related to this question, so I will open another one.

trashgod
  • 203,806
  • 29
  • 246
  • 1,045
RE6
  • 2,684
  • 4
  • 31
  • 57
0

what i have so far, it is not the best one, but I hope it's helpful... simple with string

public void read_data() {
    try {
        file = new File("_result.xml");
        FileReader fileReader = new FileReader(file);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String line = "";
        String output = "";
        int a = 0, b = 0;
        boolean _write = false;

        while ((line = bufferedReader.readLine()) != null) {
            if(line.trim().startsWith("<td")) { _write = true; } else { _write = false; }

            if(_write) {
                a = line.indexOf('>')+1;
                b = line.lastIndexOf('<');
                output += line.substring(a,b) + "|";
            }

            if(line.trim().equals("</tr>")) {
                System.out.println(output);
                output = "";
            }

        }
        fileReader.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
karoles
  • 1
  • 1
0

For my own needs, I found a way that javascript automatically converts a table into something like a 2D array. Consider the following code:

document.querySelector("#table").children[0].children[r].children[c].innerText

In the above, r = the row index and c = the column index. Data can be accessed just like a 2D array using the row and column indices, automatically.

Here is yet another way, similar to the 2D-array access, but with CSS selectors:

document.querySelector("tr:nth-child(5) td:nth-child(4)")

finding the 4th column in the 5th row

JohnP2
  • 1,899
  • 19
  • 17