-1

I have a huge HTML Table (about 500,000 rows) that I need to transform into a JSON file. The table looks something like this:

<table>
<tr>
<th>Id</th>
<th>Timestamp</th>
<th>Artist_Name</th>
<th>Tweet_Id</th>
<th>Created_at</th>
<th>Tweet</th>
<th>User_name</th>
<th>User_Id</th>
<th>Followers</th>
</tr>
<tr>
<td>1</td>
<td>2013-06-07 16:00:17</td>
<td>Kelly Rowland</td>
<td>343034567793442816</td>
<td>Fri Jun 07 15:59:48 +0000 2013</td>
<td>So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...</td>
<td>Nicole Barrett</td>
<td>33831594</td>
<td>62</td>
</tr>
<tr>
<td>2</td>
<td>2013-06-07 16:00:17</td>
<td>Kelly Rowland</td>
<td>343034476395368448</td>
<td>Fri Jun 07 15:59:27 +0000 2013</td>
<td>RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht…</td>
<td>A.J.</td>
<td>24193447</td>
<td>340</td>
</tr>

I would like to create a JSON file that looks sth like that:

{'data': [
  {
   'text': 'So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...', 
   'id': 1, 
   'tweet_id': 343034567793442816
  },
  {
   'text': 'RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht…', 
   'id': 2, 
   'tweet_id': 343034476395368448
  }
]}

Maybe with some more of the variables included but that should be self explaining.

I have already looked into several options but mostly I have the problem that my HTML Table is so big. I saw a lot of people recommending jQuery. Does that make sense for me considering the size of my table? If there is a suitable Python option I would be pretty much in favor as I have been writing most of my code so far in Python.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Tom
  • 425
  • 3
  • 6
  • 20
  • 3
    Wow, a html table with 500,000 rows, man, that's huge, i'd say too huge... Why not using pagination? BTW, from where come first this data? I don't think you have hardcoded your table, right?! – A. Wolff Jul 04 '13 at 10:11
  • Data is crawled from Twitter. I have it in a database but exporting was so far only successful in HTML. With all other formats the database tool cancelled my export requests. – Tom Jul 04 '13 at 10:32

3 Answers3

0

Here is the sample code.

var tbl = $('table tr:has(td)').map(function(i, v) {
var $td =  $('td', this);
    return {
             Id: $td.eq(0).text(),
             Timestamp: $td.eq().text(),
             Artist_Name: $td.eq(2).text(),
             Tweet_Id: $td.eq(3).text()               
             Tweet: $td.eq(4).text()              
             User_name: $td.eq(5).text()               
             User_Id: $td.eq(6).text()                
             Followers: $td.eq(7).text()                
           }
}).get();
Hary
  • 5,690
  • 7
  • 42
  • 79
  • 1
    with 500,000 rows, i wouldn't be surprised js fired an error like script makes too much time to respond – A. Wolff Jul 04 '13 at 10:17
  • map() creates a new array, meaning this will eat up much more resources than each() - something worth noting when you are working with 500,000 rows. – verenion Jul 04 '13 at 10:20
  • then u should be using setTimeout this will behave badly in IE for lot of rows – wilsonrufus Jul 04 '13 at 10:20
0

Are you using php? If so, you could do something using HTML DOM parsing and json_encode().

If you want to use jquery / javascript, this is going to be TOO big to process - it is possible, but it's not great to be processing that much data using js. However, if you are only doing it once, and you really are determined to use JS - then there is a similar question on SO...

Check out this fiddle...

http://jsfiddle.net/s4tyk/

var myRows = [];
var headersText = [];
var $headers = $("th");

// Loop through grabbing everything
var $rows = $("tbody tr").each(function(index) {
  $cells = $(this).find("td");
  myRows[index] = {};

  $cells.each(function(cellIndex) {
    // Set the header text
    if(headersText[cellIndex] === undefined) {
      headersText[cellIndex] = $($headers[cellIndex]).text();
    }
    // Update the row object with the header/cell combo
    myRows[index][headersText[cellIndex]] = $(this).text();
  });    
});

// Let's put this in the object like you want and convert to JSON (Note: jQuery will also do this for you on the Ajax request)
var myObj = {
    "myrows": myRows
};
alert(JSON.stringify(myObj));

this is the question How to convert the following table to JSON with javascript?

if you need help with the js, just ask, but this should get you going.

Community
  • 1
  • 1
verenion
  • 383
  • 1
  • 13
0

Using Python and the lxml package, you could parse the HTML with

import lxml.html as LH
import collections
import itertools as IT
import json

Row = collections.namedtuple(
    'Row',
    'id timestamp artist tweet_id created_at tweet user_name user_id, followers')

filename = '/tmp/test.html'
root = LH.parse(filename)
data = []
result = {'data': data}
for row in IT.starmap(Row, zip(*[iter(root.xpath('//tr/td/text()'))]*9)):
    data.append({'text':row.tweet, 'id':row.id, 'tweet_id':row.tweet_id})

with open('/tmp/test.json', 'w') as f:
    json.dump(result, f, indent=2)

It took about 50 seconds on a file with 500K <tr> tags and required around 910M RAM (to load the HTML into a DOM with root = LH.parse(filename)).

Loading the json file into a Python dict took about 2 seconds:

In [14]: time x = json.load(open('/tmp/test.json'))
CPU times: user 1.80 s, sys: 0.04 s, total: 1.84 s
Wall time: 1.85 s
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • for me this only gives: { "data": [] } – Tom Jul 04 '13 at 20:06
  • That might happen if your HTML does not look exactly like what you posted. Try using `root.xpath('//tr/td/text()')`. (I'll edit my post to show what I mean.) The two forward-slashes tell `root.xpath` to search recursively for all `tr` tags, not just at the top level. – unutbu Jul 04 '13 at 20:53
  • Ah ok. That is my fault. The HTML code starts always at the beginning of the line. I edited the example so that is more easily readable. Thought that would not make a difference. Will adjust the example as well. – Tom Jul 05 '13 at 09:11
  • Hold on -- it is not the whitespace (indentation) that is a problem. The problem solved by the two forward-slashes has to do with HTML that begins with extra tags, such as `...` The two forward-slashes tells XPath to seach down through the ``, ``, and any other tags until if finds a ``.
    – unutbu Jul 05 '13 at 10:54
  • I tried it with several adjustments. My machine is processing put it just never comes to an end. After over 2hrs I gave up ;( – Tom Jul 08 '13 at 17:57
  • When you canceled the program, what did the traceback say? It will tell us what line Python was executing at the time of termination. – unutbu Jul 08 '13 at 18:09