Split file by XML tag

Question

I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:

< text id="www.example.com>

and

< /text>

I would like to split the larger file by these tags. So that, for example,

< text id="www.example.com>

Hello

< /text>

< text id="www.example.com>

This is

< /text>

< text id="www.example.com>

An Example

< /text>

Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:

File 1

< text id="www.example.com>

Hello

< /text>

File 2

< text id="www.example.com>

This is

< /text>

File 3

< text id="www.example.com>

An Example

< /text>

I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.

I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?

Thanks in advance for any help!

score 2 · Accepted Answer · edited May 23 '17 at 12:21

The following PERL program found here: Split one file into multiple files based on delimiter

#!/usr/bin/perl
open(FI,"file.txt") or die;
$cur=0;
open(FO,">res.$cur.txt") or die;
while(<FI>)
{
    print FO $_;
    if(/^<\/text>/) # Added \
    {
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die;
    }
}
close(FO);

Also seems to do the trick, with no maximum cap.

Cheers.

score 1 · Answer 2 · answered Mar 19 '13 at 17:04

1

The following awk solves the problem, but unfortunately caps out at around 1000 output files

awk '{print $0 ""> "file" NR}' RS='' input-file

answered Mar 19 '13 at 17:04

owwoow14

1,694
8
28
43

score 1 · Answer 3 · answered Mar 19 '13 at 17:09

1

It's a lot more complicated than a simple awk command, and I don't if the file would be to big or not, but you could try using an XSLT V2.0 style sheet with result-document to produce all of your files.

One advantage of using XSLT over a regex is that it would have better support if the file format changes slightly or if there are attributes on the nodes you want to split with.

answered Mar 19 '13 at 17:09

Stanley De Boer

4,921
1
23
31

Thanks for the tip. I will definitely check out the XSLT V2.0. style sheet. Also just for a point of reference, I agree with you about the awk (the exact error I was getting is: awk: cannot open "F1021" for output (Too many open files) – owwoow14 Mar 19 '13 at 17:20

Split file by XML tag

3 Answers3

Linked