1

I am scraping a table that will ultimately be exported into CSV format. There are several cases I may need to consider, such as nested tables, spanned rows/cells, etc. but for now I'm just going to ignore those cases and assume I have a very simple table. By "simple" I mean we just have rows and cells, possibly an unequal number of cells per row, but it's still a fairly basic in structure.

<table>
  <tr>
    <td>text </td>
    <td>text </td>
  </tr>
  <tr>
    <td>text </td>
  </tr>
</table>

My approach is to simply iterate over the rows and columns

String[] rowTxt;
WebElement table = driver.findElement(By.xpath(someLocator));
for (WebElement rowElmt : table.findElements(By.tagName("tr")))
{
    List<WebElement> cols = rowElmt.findElements(By.tagName("td"));
    rowTxt = new String[cols.size()];
    for (int i = 0; i < rowTxt.length; i++)
    {
        rowTxt[i] = cols.get(i).getText();
    }
}

However, this is quite slow. For a CSV file with 218 lines (which means, my table has 218 rows), each line having no more than 5 columns, it took 45 seconds to scrape the table.

I had tried to avoid iterating over each cell by using getText on the row element hoping that the output would be delimited by something, but it wasn't.

Is there a better way to scrape a table?

MxLDevs
  • 19,048
  • 36
  • 123
  • 194
  • Alternatively, I may consider using selenium to get the page source, and then use Jsoup to do the actual HTML parsing, since I liked Jsoup's performance. – MxLDevs Jan 20 '14 at 21:18

3 Answers3

6

Rather than using selenium to parse the HTML, I use Jsoup. While Selenium provides functionality for traversing through a table, Jsoup is much more efficient. I've decided to use Selenium only for webpage automation, and delegate all parsing tasks to Jsoup.

My approach is as follows

  1. Get the HTML source for the required element
  2. Pass that to Jsoup as a string to parse

The code that I ended up writing was very similar to the selenium version

String source = "<table>" + driver.findElement(By.xpath(locator)).getAttribute("innerHTML") + "<table>";
Document doc = Jsoup.parse(source, "UTF-8");
for (Element rowElmt : doc.getElementsByTag("tr"))
{
    Elements cols = rowElmt.getElementsByTag("th");
    if (cols.size() == 0 )
        cols = rowElmt.getElementsByTag("td");

    rowTxt = new String[cols.size()];
    for (int i = 0; i < rowTxt.length; i++)
    {
        rowTxt[i] = cols.get(i).text();
    }
    csv.add(rowTxt);
}

The Selenium parser takes 5 minutes to read a 1000 row table, while the Jsoup parser takes less than 10 seconds. While I did not spend much time on benchmarking, I am pretty satisfied with the results.

MxLDevs
  • 19,048
  • 36
  • 123
  • 194
2

It most definetly is slow, no matter whether you use xpath, id or css to do your location. That said, if you were to use the pageObject pattern, you could make use of the @CacheLookup annotation. From the source:

  • By default, the element or the list is looked up each and every time a method is called upon it.
  • To change this behaviour, simply annotate the field with the {@link CacheLookup}.

I did a test using table of 100 rows and 6 columns, the test queried the text of each and every td element. Without the @CacheLookup the time taken (element was located by XPath as in your case) approx. 40sec. Using cache lookup, it dropped down to approx. 20sec, but it is still too much.

Anyway, if you would lose the firefox driver and run you tests headless (using htmlUnit), the speed would increase drastically. Running the same test headless, the times were between 100-200ms, so it could even be faster than Jsoup.

You can check/try my test code here.

Erki M.
  • 5,022
  • 1
  • 48
  • 74
  • I'll have to see whether HtmlUnitDriver supports the site I'm using it on, since I have had a number of javascript-related issues that I had not figured out how to get around. So I went with a browser to handle the javascript for me. – MxLDevs Jan 26 '14 at 19:52
2

I'm using HtmlAgilityPack installed as a Nuget to parse dynamic html tables. its very fast and as per this answer you can query the results using linq. I've used this to store the result as a DataTable. Here's the public extension method class:-

public static class HtmlTableExtensions
{
    private static readonly ILog Log = LogManager.GetLogger(typeof(HtmlTableExtensions));

    /// <summary>
    ///     based on an idea from https://stackoverflow.com/questions/655603/html-agility-pack-parsing-tables
    /// </summary>
    /// <param name="tableBy"></param>
    /// <param name="driver"></param>
    /// <returns></returns>
    public static HtmlTableData GetTableData(this By tableBy, IWebdriverCore driver)
    {
        try
        {
            var doc = tableBy.GetTableHtmlAsDoc(driver);
            var columns = doc.GetHtmlColumnNames();
            return doc.GetHtmlTableCellData(columns);
        }
        catch (Exception e)
        {
            Log.Warn(String.Format("unable to get table data from {0} using driver {1} ",tableBy ,driver),e);
            return null;
        }
    }

    /// <summary>
    ///     Take an HtmlTableData object and convert it into an untyped data table,
    ///     assume that the row key is the sole primary key for the table,
    ///     and the key in each of the rows is the column header
    ///     Hopefully this will make more sense when its written!
    ///     Expecting overloads for swichting column and headers,
    ///     multiple primary keys, non standard format html tables etc
    /// </summary>
    /// <param name="htmlTableData"></param>
    /// <param name="primaryKey"></param>
    /// <param name="tableName"></param>
    /// <returns></returns>
    public static DataTable ConvertHtmlTableDataToDataTable(this HtmlTableData htmlTableData,
        string primaryKey = null, string tableName = null)
    {
        if (htmlTableData == null) return null;
        var table = new DataTable(tableName);

        foreach (var colName in htmlTableData.Values.First().Keys)
        {
            table.Columns.Add(new DataColumn(colName, typeof (string)));
        }
        table.SetPrimaryKey(new[] { primaryKey });
        foreach (var values in htmlTableData
            .Select(row => row.Value.Values.ToArray<object>()))
        {
            table.Rows.Add(values);
        }

        return table;
    }


    private static HtmlTableData GetHtmlTableCellData(this HtmlDocument doc, IReadOnlyList<string> columns)
    {
        var data = new HtmlTableData();
        foreach (
            var rowData in doc.DocumentNode.SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableRow)
                .Skip(1)
                .Select(row => row.SelectNodes(HtmlAttributes.TableCell)
                    .Select(n => WebUtility.HtmlDecode(n.InnerText)).ToList()))
        {
            data[rowData.First()] = new Dictionary<string, string>();
            for (var i = 0; i < columns.Count; i++)
            {
                data[rowData.First()].Add(columns[i], rowData[i]);
            }
        }
        return data;
    }

    private static List<string> GetHtmlColumnNames(this HtmlDocument doc)
    {
        var columns =
            doc.DocumentNode.SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableRow)
                .First()
                .SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableHeader)
                .Select(n => WebUtility.HtmlDecode(n.InnerText).Trim())
                .ToList();
        return columns;
    }

    private static HtmlDocument GetTableHtmlAsDoc(this By tableBy, IWebdriverCore driver)
    {
        var webTable = driver.FindElement(tableBy);
        var doc = new HtmlDocument();
        doc.LoadHtml(webTable.GetAttribute(HtmlAttributes.InnerHtml));
        return doc;
    }
}

The html data object is just an extension of dictionary:-

public class HtmlTableData : Dictionary<string,Dictionary<string,string>>
{

}

IWebdriverCore driver is a wrapper on IWebDriver or IRemoteWebdriver which exposes either of these interfaces as a readonly property, but you could just replace this with IWebDriver.

HtmlAttributes is a static lass holding const values for common html attributes to save on typos when referring to html elements/attributes/tags etc. in c# code:-

/// <summary>
/// config class holding common Html Attributes and tag names etc
/// </summary>
public static class HtmlAttributes
{
    public const string InnerHtml = "innerHTML";
    public const string TableRow = "tr";
    public const string TableHeader = "th";
    public const string TableCell = "th|td";
    public const string Class = "class";

... }

and SetPrimaryKey is an extension of DataTable which allows easy setting of the primary key for a datatable:-

    public static void SetPrimaryKey(this DataTable table,string[] primaryKeyColumns)
    {
        int size = primaryKeyColumns.Length;
        var keyColumns = new DataColumn[size];
        for (int i = 0; i < size; i++)
        {
            keyColumns[i] = table.Columns[primaryKeyColumns[i]];
        }
        table.PrimaryKey = keyColumns;
    }

I found this to be pretty performant - < 2 ms to parse a 30*80 table, and its a doddle to use.

Community
  • 1
  • 1
Dave00Galloway
  • 609
  • 1
  • 6
  • 20