Split XML file into chunks after tag

Question

I have an XML file that's ~1GB big, with

grep -c "</record>')," file
238613

I'd like to split it into chunks of 1000 records, but each file needs to end with

</record>'),

I would then end up with 238 files.

Here is the actual file with the first two records:

\set bib_tag '''IMPORT CONCERTO'''
INSERT INTO marcxml_import (tag, marc) VALUES
(:bib_tag,'<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<datafield and subfield data>
</record>'),
(:bib_tag,'<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<datafield and subfield data>
</record>'),

Please post a sample of your data. If it is sensitive, make up something. — Hai Vu, Feb 10 '14 at 20:13
dive in (http://docs.python.org/3/library/xml.etree.elementtree.html) — MLSC, Feb 10 '14 at 20:15

score 0 · Answer 1 · edited May 23 '17 at 10:25

0

You should use a language or program which support XML parsing. You could choice one of them in this following list :

Perl, Python, Ruby, PHP-cli (with SimpleXMLElement and Xpath for example), xmllint, etc

You should avoid regular expressions to do that task.

Here is an example of using a php shell script with Xpath queries : https://stackoverflow.com/a/20940216/2900196

edited May 23 '17 at 10:25

Community

1
1

answered Feb 10 '14 at 20:20

Idriss Neumann

3,760
2
23
32

score 0 · Answer 2 · answered Feb 10 '14 at 20:29

0

You could write a small XSLT script to split the file.

Using a template, a for-each loop and a result-document should be sufficient.

answered Feb 10 '14 at 20:29

Heiko Nardmann

171
1
11

score 0 · Answer 3 · answered Feb 11 '14 at 01:39

0

Using gnu awk

awk '{print $0 RS >NR ".xml"}' RS="</record>')," file

After run, you should get several xml (or hundreds)

cat 1.xml

set bib_tag '''IMPORT CONCERTO'''
INSERT INTO marcxml_import (tag, marc) VALUES
(:bib_tag,'<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<datafield and subfield data>
</record>'),

answered Feb 11 '14 at 01:39

BMW

42,880
12
99
116

you can install GNU awk in mac. – BMW Oct 29 '14 at 04:33

Anno2001 · Answer 4 · 2014-10-28T10:35:26.377

0

At least on Mac you can split files simply with the split command:

split -p "</record>')," file bib_snippet_

-p argument for pattern

Update: since you're required the files to end with "..record..." you need to manually add this yourself in this approach:

for f in `ls bib_snippet_*` ; do cat "</record>')," >> $f ; done

edited Oct 28 '14 at 10:35

answered Oct 28 '14 at 10:16

Anno2001

1,343
12
17

Split XML file into chunks after tag

4 Answers4