1

I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:

< text id="www.example.com>

and

< /text>

I would like to split the larger file by these tags. So that, for example,

< text id="www.example.com>

Hello

< /text>

< text id="www.example.com>

This is

< /text>

< text id="www.example.com>

An Example

< /text>

Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:

File 1

< text id="www.example.com>

Hello

< /text>

File 2

< text id="www.example.com>

This is

< /text>

File 3

< text id="www.example.com>

An Example

< /text>

I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.

I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?

Thanks in advance for any help!

owwoow14
  • 1,694
  • 8
  • 28
  • 43

3 Answers3

2

The following PERL program found here: Split one file into multiple files based on delimiter

#!/usr/bin/perl
open(FI,"file.txt") or die;
$cur=0;
open(FO,">res.$cur.txt") or die;
while(<FI>)
{
    print FO $_;
    if(/^<\/text>/) # Added \
    {
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die;
    }
}
close(FO);

Also seems to do the trick, with no maximum cap.

Cheers.

Community
  • 1
  • 1
owwoow14
  • 1,694
  • 8
  • 28
  • 43
1

The following awk solves the problem, but unfortunately caps out at around 1000 output files

awk '{print $0 ""> "file" NR}' RS='' input-file
owwoow14
  • 1,694
  • 8
  • 28
  • 43
1

It's a lot more complicated than a simple awk command, and I don't if the file would be to big or not, but you could try using an XSLT V2.0 style sheet with result-document to produce all of your files.

One advantage of using XSLT over a regex is that it would have better support if the file format changes slightly or if there are attributes on the nodes you want to split with.

Stanley De Boer
  • 4,921
  • 1
  • 23
  • 31
  • Thanks for the tip. I will definitely check out the XSLT V2.0. style sheet. Also just for a point of reference, I agree with you about the awk (the exact error I was getting is: awk: cannot open "F1021" for output (Too many open files) – owwoow14 Mar 19 '13 at 17:20