2

I would like process a stackexchange raw data into BigQuery, but first the data use a 7z compress format, so I uncompressed the data to port this to gz format, but the internal file is a xml. So I need convert the file from xml to json. Any ideas? I used p7zip to uncompress and xml2json to try port the xml file but not work.

<?xml version="1.0" encoding="utf-8"?> <comments> <row Id="1" PostId="1" Score="3" Text="We need to all post more questions. Last time, we kinda &quot;rushed&quot; to get a w hole bunch of people to sign up at the last minute (and pulled some funny stuff" CreationDate="2014-02-12T01:01:14.257" UserId="52" />..

I used xml2json xml2json -t json2xml -o xxx.xml yyy.json

Other test using xml-json **David recomendations

Used this file Users.xml(size 895M) from stackoverflow.com-Users.7z with this command:xml-json Users.xml row > Users.json

xml-json Users.xml row > Users.json /usr/local/lib/node_modules/xml-json/node_modules/xml-nodes/index.js:19 this.soFar += String(chunk)
RangeError: Invalid string length
at XmlNodes._transform (/usr/local/lib/node_modules/xml-json/node_modules/xml-nodes/index.js:19:15)
at XmlNodes.Transform._read (_stream_transform.js:183:22)
at XmlNodes.Transform._write (_stream_transform.js:167:12)
at doWrite (_stream_writable.js:265:12)
at writeOrBuffer (_stream_writable.js:252:5)
at XmlNodes.Writable.write (_stream_writable.js:197:11)
at Duplexify._write (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/index.js:197:22)
at doWrite (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:237:10)
at writeOrBuffer (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:227:5)
at Writable.write (/usr/local/lib/node_modules/xml-json/node_modules/pumpify/node_modules/duplexify/node_modules/readable-stream/lib/_stream_writable.js:194:11)
at ReadStream.ondata (_stream_readable.js:539:20)
at ReadStream.emit (events.js:107:17)
at readableAddChunk (_stream_readable.js:162:16)
at ReadStream.Readable.push (_stream_readable.js:125:10)
at onread (fs.js:1581:12)
at Object.wrapper [as oncomplete] (fs.js:482:17)
Nicolas Bortolotti
  • 555
  • 1
  • 8
  • 22
  • "It does not work" is not an error description. And you neither gave a sample of your input, nor a specification of the output you require, nor did you show what exactly you have tried. How do you assume would anyone except a clairvoyant be able to help you? – Tomalak Oct 01 '14 at 06:53
  • Sure is true, I can't process the data. I used stackoverflow.com-Comments.7z (from https://archive.org/details/stackexchange) 1.8GB. but when a try to convert xml file to json into the 7z. the process never end. I used a VM with 13GB RAM and 2 processors. Maybe exist other way to process large files o convert files with the objective to put into BigQuery the data. – Nicolas Bortolotti Oct 01 '14 at 14:22
  • Look. The specs of your machine are unimportant for the task. Get a text editor that can open large text files. Open the XML and cut out a representative sample. From that sample, carefully create the JSON you want to see. Post both code samples here and I (or somebody else, for that matter) will be able to see what they can do. You won't get an answer like: *"Just use tool XYZ to convert the XML to JSON."*, partly because you did not post any hard requirements whatsoever, and partly because this magic tool might not even exist. – Tomalak Oct 01 '14 at 14:37
  • Ok this is a file example (xml to convert),is simple but when I can convert this 5GB the tool crash, actually I don't have error message just crash, if I use 100mb file size works but.. I used xml2json xml2json -t json2xml -o xxx.xml yyy.json *** ... – Nicolas Bortolotti Oct 01 '14 at 15:14
  • It's best if update your question directly (just click "edit"), comments are no place to post code. – Tomalak Oct 01 '14 at 15:18

2 Answers2

2

David M Smith's answer is correct, converting to CSV would also work.

After several attempts (and edits to my answer as I did not test thoroughly before answering), I managed to create a correct json file with a small Python script like this :

#!python
from __future__ import print_function
import sys
import fileinput
import xml
from xml.dom import minidom
import json

for line in fileinput.input():
        try:
                xmlDoc = minidom.parseString(line)
                print(json.dumps(dict(xmlDoc.childNodes[0].attributes.items())))
        except xml.parsers.expat.ExpatError:
                print("Unable to process line : ", line, file=sys.stderr)
        except KeyboardInterrupt:
                sys.exit(0)

Then you will probably need to restart your shell to update the path (or any other method).

For the biggest files, I needed to split them before because BigQuery accepts files of maximum 4GB. Here's the complete flow :

7z x -so ../orig/stackoverflow.com-Posts.7z 2> /dev/null | ./xmltojson.py > PostHistory3.json
split -e -d -C3G --additional-suffix=.json Posts.json Postssplit
ls Postssplit*.json | xargs -ifile gzip file 
gsutil cp Postssplit*.json.gz gs://YOURBUCKET
bq --project_id=YOURPROJECT load --source_format=NEWLINE_DELIMITED_JSON YOURDATASET.YOURTABLE gs://YOURBUCKET/Postssplit01.json,gs://YOURBUCKET/Postssplit03.json,gs://YOURBUCKET/Postssplit04.json,#ETCETERA 'Id:INTEGER,PostTypeId:INTEGER,AcceptedAnswerId:INTEGER,ParentId:INTEGER,CreationDate:TIMESTAMP,Score:INTEGER,ViewCount:INTEGER,Body:STRING,OwnerUserId:INTEGER,OwnerDisplayName:STRING,LastEditorUserId:INTEGER,LastEditorDisplayName:STRING,LastEditDate:TIMESTAMP,LastActivityDate:TIMESTAMP,Title:STRING,Tags:STRING,AnswerCount:INTEGER,CommentCount:INTEGER,FavoriteCount:INTEGER,ClosedDate:TIMESTAMP,CommunityOwnedDate:TIMESTAMP'

The gsutil part is not mandatory, but I am more comfortable uploading my files in Cloud Storage and then importing. That way if the import fails I can retry.

If someone from the Google team is reading, it would be great to get this as a public dataset :-)

Note that this will not work for any XMLs, only for the ones that are formatted like the current Stack Exchange export formats.

David
  • 5,481
  • 2
  • 20
  • 33
  • Thanks David! with this procedure worked perfectly. Just I added permissions to create the xml-json stream. chmod -R 777 folder – Nicolas Bortolotti Oct 02 '14 at 23:25
  • but if a try to use xml files that 200MB or 100MB. the tool not work.xml-json file.xml row > file.json (I split the xml to process but nothing yet) – Nicolas Bortolotti Oct 03 '14 at 01:48
  • I just tested it for the Votes.xml file, 6.5GB once unzipped. It does work, but it takes an outrageous amount of time (around 4 hours). – David Oct 03 '14 at 06:12
  • Yes,, but 4 hours and a lot of process...umm ¿options? Use a Hadoop cluster to process and reduce the time? – Nicolas Bortolotti Oct 03 '14 at 14:41
  • If you want to go to that extent just code your ad hoc parser with streaming capabilities, it will be faster. – David Oct 03 '14 at 14:42
  • David, I try with User.xml from stackoverflow.com-Users.7z but not works..for other situations but the tool no parallelized the process and the time is a lot to process files. maybe if we use other tool we can process we more cpu power.. is an idea..thanks a lot for your cooperation. – Nicolas Bortolotti Oct 03 '14 at 16:18
  • I haven't tried this myself, but here's one other idea: upload your JSON as a single-column CSV file, where each row is a JSON object (e.g. '{"a": 1, "b": [4, 5]}'.) Then use BQ's JSON_EXTRACT() functions to extract the values you want out to another table. This should work as long as each object (row) is 1MB or less. In other words, let BQ do the ETL heavy lifting for you ;-) https://cloud.google.com/bigquery/query-reference#jsonfunctions – David M Smith Oct 03 '14 at 18:12
  • @NickBortolotti I also had problems with big files so I've updated the solution to use a custom Python script. – David Oct 08 '14 at 05:15
  • Ok David.. yes I think is the best option..Thanks! – Nicolas Bortolotti Oct 08 '14 at 23:23
1

Do you really need JSON? Unless your data is hierarchical in structure CSV might be quicker/easier.

Use something like xml2csv-conv to convert your data to CSV, then upload using the bq command line tool:

bq load mydataset.mytable mydata.csv "column1:string,column2:string ..."
David M Smith
  • 2,212
  • 4
  • 21
  • 27
  • Sure if is csv great!, but the size files are in avg 5 GB (2GB-4GB-6GB) so...this tool support large files? I think not but I try right now.. – Nicolas Bortolotti Oct 01 '14 at 15:00
  • Gzipping the file before you upload will help. Depending on the gzip file size you may also need to upload to Google cloud storage first, then use bq to import into your table. https://cloud.google.com/bigquery/loading-data-into-bigquery#loaddatagcs – David M Smith Oct 02 '14 at 04:18
  • ...but when I try to process a file with 800MB size xml2csv-conv -l row file.xml file.csv Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.createChunk edDocumentImpl.java:1932) – Nicolas Bortolotti Oct 03 '14 at 00:37
  • http://stackoverflow.com/questions/37335/how-to-deal-with-java-lang-outofmemoryerror-java-heap-space-error-64mb-heap – David M Smith Oct 03 '14 at 17:56