3

I'd like to parse an HTML page and get the table values. For example parsing through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

Let's say that the table is:

table

<table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td>      
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>        
    <td>94</td>
  </tr>
</table>

result

[Jill,  Smith,  50]
[Eve,   Jackson,    94]

I'm achieving this by two ways:

  1. Using Xpath :

    page.body.div.table.tr.time;
    
  2. Using closure like this:

    page."**".findAll { it.@class.toString().contains("time")}.each {
    

Both ways use XMLSlurper:

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2')
def parser = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())

So is there another way of getting table values using groovy

Thanks for the help!

dmahapatro
  • 49,365
  • 7
  • 88
  • 117
DataScientYst
  • 442
  • 2
  • 7
  • 19
  • 2
    Any issues with either of the above ways due to which a third approach is required? – dmahapatro May 08 '16 at 15:51
  • 1
    Should something in your example html have a class of "time" – tim_yates May 08 '16 at 18:19
  • 1
    1) The main concern of the first approach is the hardcoded solution. It's not agile. In case of changes of the structure then unexpected results could be returned. The second approach is my preferable way of doing it right now. Here the only problem is the computational cost and the need of regular expressions for some cases. I was searching for general solution similar to : http://stackoverflow.com/questions/6325216/parse-html-table-to-python-list – DataScientYst May 09 '16 at 04:22

1 Answers1

2

I have had good results using the jsoup HTML parser. It's a java library but works well with Groovy. Here's an example of parsing a table in Java, and a helpful blog entry on scraping using Groovy and jsoup. This question has an answer with a groovy example on parsing a table.

Community
  • 1
  • 1
Nicholas
  • 15,916
  • 4
  • 42
  • 66
  • And this is the working example that I've found: http://stackoverflow.com/questions/5396098/how-to-parse-a-table-from-html-using-jsoup. There is a groovy version as well. Thank you. – DataScientYst May 09 '16 at 13:51