2

I have an XML file that's ~1GB big, with

grep -c "</record>')," file
238613 

I'd like to split it into chunks of 1000 records, but each file needs to end with

</record>'),   

I would then end up with 238 files.

Here is the actual file with the first two records:

\set bib_tag '''IMPORT CONCERTO'''
INSERT INTO marcxml_import (tag, marc) VALUES
(:bib_tag,'<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<datafield and subfield data>
</record>'),
(:bib_tag,'<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<datafield and subfield data>
</record>'),
BMW
  • 42,880
  • 12
  • 99
  • 116
Carlo
  • 188
  • 2
  • 12

4 Answers4

0

You should use a language or program which support XML parsing. You could choice one of them in this following list :

Perl, Python, Ruby, PHP-cli (with SimpleXMLElement and Xpath for example), xmllint, etc

You should avoid regular expressions to do that task.

Here is an example of using a php shell script with Xpath queries : https://stackoverflow.com/a/20940216/2900196

Community
  • 1
  • 1
Idriss Neumann
  • 3,760
  • 2
  • 23
  • 32
0

You could write a small XSLT script to split the file.

Using a template, a for-each loop and a result-document should be sufficient.

Heiko Nardmann
  • 171
  • 1
  • 11
0

Using gnu awk

awk '{print $0 RS >NR ".xml"}' RS="</record>')," file

After run, you should get several xml (or hundreds)

cat 1.xml

set bib_tag '''IMPORT CONCERTO'''
INSERT INTO marcxml_import (tag, marc) VALUES
(:bib_tag,'<record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<datafield and subfield data>
</record>'),
BMW
  • 42,880
  • 12
  • 99
  • 116
0

At least on Mac you can split files simply with the split command:

split -p "</record>')," file bib_snippet_

-p argument for pattern

Update: since you're required the files to end with "..record..." you need to manually add this yourself in this approach:

for f in `ls bib_snippet_*` ; do cat "</record>')," >> $f ; done
Anno2001
  • 1,343
  • 12
  • 17