11

I hope the subject is clear enough, I haven't found anything specifically about this in the previously asked bin. I've tried implementing this in Perl or Python, but I think I may be trying too hard.

Is there a simple shell command / pipeline that will split my 4mb .txt file into seperate .txt files, based on a beginning and ending regex?

I provide a short sample of the file below.. so you can see that every "story" starts with the phrase "X of XXX DOCUMENTS", which could be used to split the file.

I think this should be easy and I'd be surprised if bash can't do it - faster than Perl/Py.

Here it is:

                           1 of 999 DOCUMENTS


              Copyright 2011 Virginian-Pilot Companies LLC
                          All Rights Reserved
                   The Virginian-Pilot(Norfolk, VA.)

...



                           3 of 999 DOCUMENTS


                  Copyright 2011 Canwest News Service
                          All Rights Reserved
                          Canwest News Service

...

Thanks in advance for all your help.

Ross

peterh
  • 11,875
  • 18
  • 85
  • 108

5 Answers5

22
awk '/[0-9]+ of [0-9]+ DOCUMENTS/{g++} { print $0 > g".txt"}' file

OSX users will need gawk, as the builtin awk will produce an error like awk: illegal statement at source line 1

Ruby(1.9+)

#!/usr/bin/env ruby
g=1
f=File.open(g.to_s + ".txt","w")
open("file").each do |line|
  if line[/\d+ of \d+ DOCUMENTS/]
    f.close
    g+=1
    f=File.open(g.to_s + ".txt","w")
  end
  f.print line
end
Ian
  • 11,280
  • 3
  • 36
  • 58
kurumi
  • 25,121
  • 5
  • 44
  • 52
  • OH and we have a winner....speed *AND* elegance I spent a really wet summer in 1997 with the O'Reilly sed/awk book. Wish I could recall all that now. I *will* go and get it tmrw. **THANK YOU** –  Feb 10 '11 at 01:32
  • 1
    This solution puts the matching line in the new file, which answers the question. But if, like me, you want to put the matching line in the old file before starting the new one, you'd do this: `awk '{print $0 > n".txt"} /text to match/ {n++}` – indiv Mar 14 '14 at 01:03
  • 1
    Note: on Mac OS X you need `gawk` from e.g. MacPorts for this to work – Thomas Wana Apr 28 '16 at 15:42
10

As suggested in other solutions, you could use csplit for that:

csplit csplit.test '/^\.\.\./' '{*}' && sed -i '/^\.\.\./d' xx*

I haven't found a better way to get rid of the reminiscent separator in the split files.

raphink
  • 3,625
  • 1
  • 28
  • 39
  • I can't try right now because on windows, but csplit's man page seems to suggests using %REGEX% instead of /REGEX/ for that: /REGEXP/[OFFSET] copy up to but not including a matching line %REGEXP%[OFFSET] skip to, but not including a matching line – Spikolynn Jul 07 '15 at 17:51
1

How hard did you try in Perl?

Edit Here is a faster method. It splits the file then prints the part files.

use strict;
use warnings;

my $count = 1;

open (my $file, '<', 'source.txt') or die "Can't open source.txt: $!";

for (split /(?=^.*\d+[^\S\n]*of[^\S\n]*\d+[^\S\n]*DOCUMENTS)/m, join('',<$file>))
{
    if ( s/^.*(\d+)\s*of\s*\d+\s*DOCUMENTS.*(\n|$)//m )
    {
        open (my $part, '>', "Part$1_$count.txt") 
            or die "Can't open Part$1_$count for output: $!";
        print $part $_;
        close ($part);
        $count++;
    }
}
close ($file);

This is the line by line method:

use strict;
use warnings;

open (my $masterfile, '<', 'yourfilename.txt') or die "Can't open yourfilename.txt: $!";

my $count = 1;
my $fh;

while (<$masterfile>) {
    if ( /(?<!\d)(\d+)\s*of\s*\d+\s*DOCUMENTS/ ) {
        defined $fh and close ($fh);
        open ($fh, '>', "Part$1_$count.txt") or die "Can't open Part$1_$count for  output: $!";
        $count++;
        next;
    }
    defined $fh and print $fh $_;
}
defined $fh and close ($fh);
close ($masterfile);
  • `$count` is undefined. I suspect you meant `$cnt`. Also, the first time you run through the loop `$fh` is undefined, so you'll get a `Can't use an undefined value as a symbol reference` error/warning when you try to close `$fh`. – CanSpice Feb 10 '11 at 00:42
  • Way over my head in Perl - not that I'm not trying... Perl , Python, R, bits of Ruby, bash, a little C++ . As well as being a jobbing Doc and trying to do some research... ta for the help. –  Feb 10 '11 at 00:59
  • Might as well put a check on the final close() too –  Feb 10 '11 at 01:00
  • @rosser - oh its not that bad in Perl. A shaved down version could be done from the command line, a so called 1 liner. –  Feb 10 '11 at 01:05
  • Can't use an undefined value as a symbol reference at getfile.pl line 16, <$masterfile> line 1. –  Feb 10 '11 at 01:11
  • @rosser - Good catch! Your right, do you know how to fix it? `defined $fh and print $fh $_;` It was just an untested example, its fixed now. I would probably write it differently for my use. –  Feb 10 '11 at 15:04
0

regex to match "X of XXX DOCUMENTS" is
\d{1,3} of \d{1,3) DOCUMENTS

reading line by line and starting to write new file upon regex match should be fine.

bw_üezi
  • 4,483
  • 4
  • 23
  • 41
-1

Untested:

base=outputfile
start=1
pattern='^[[:blank:]]*[[:digit:]]+ OF [[:digit:]]+ DOCUMENTS[[:blank:]]*$

while read -r line
do
    if [[ $line =~ $pattern ]]
    then
        ((start++))
        printf -v filecount '%4d' $start
        >"$base$filecount"    # create an empty file named like foo0001
    fi
    echo "$line" >> "$base$filecount"
done
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439