I've done a bit of parsing of data out of wikipedia. I'm particularly interested in extracting the equations so I'm only interested in part of the file.
Firstly if its WikiMedia data your interested in, it much easier to get a Labs account. It takes about a day to do and it will let you run much of the code on their machines, avoiding the need to downloading multiple gigabytes. With a Labs account you should be able to run code on a fairly up to date replication of the database avoiding the need to json entirely.
I use a simple python program to parse the data it basically runs a few regexps on each line; one to find lines containing <title>...</title>
so I know which wikipedia article it is and a few more to find the namespace and the maths tags. It can process 160MB file in 13 seconds, so might be able to do the whole 36GB in under an hour.
This code produces text files with only the data I'm interested in. If you interested the code is
import sys
import re
dump = len(sys.argv)>1 and sys.argv[1]=='-d'
titleRE = re.compile('<title>(.*)</title>')
nsRE = re.compile('<ns>(.*)</ns>')
mathRE = re.compile('</?math(.*?)>')
pageEndRE = re.compile('</page>')
supOc = 0
supCc = 0
subOc = 0
subCc = 0
title =""
attr = ""
ns = -1
inEqn = 0
for line in sys.stdin:
m = titleRE.search(line)
if m :
title = m.group(1)
expression = ""
if dump : print line
inEqn = 0
m = nsRE.search(line)
if m :
ns = m.group(1)
start = 0
pos = 0
m = mathRE.search(line,pos)
while m :
if m.group().startswith('<math'):
attr = m.group(1)
start = m.end()
pos = start
expression = ""
inEqn = 1
if m.group() == '</math>' :
end = m.start()
expression = ' '.join([expression,line[start:end]])
print title,'\t',attr,'\t',expression.lstrip().replace('<','<').replace('>',
'>').replace('&','&')
pos = m.end()
expression = ""
start = 0
inEqn = 0
m = mathRE.search(line,pos)
if start > 0 :
expression = line[start:].rstrip()
elif inEqn :
expression = ' '.join([expression,line.rstrip()])
Sorry if its a bit cryptic, but it was not ment for public consumption. Sample output is
Arithmetic mean a_1,\ldots,a_n.
Arithmetic mean A
Arithmetic mean A=\frac{1}{n}\sum_{i=1}^{n} a_i
Arithmetic mean \bar{x}
Each line has the name of article and the latex equation. This reduces the data I need to work with down to a more manageable 500k. I'm not sure is such a strategy would work for your application.
For the main enwiki data the split the xml dumps into 27 smaller files, of roughly equal size. You might find a few reasonable size files, easier to work with than either one giant file or millions of tiny files. It might be easy to split by first letter in the article title giving less than a hundred files each less than a gigabyte.