Splitting a large txt file into 200 smaller txt files on a regex using shell script in BASH

Question

I hope the subject is clear enough, I haven't found anything specifically about this in the previously asked bin. I've tried implementing this in Perl or Python, but I think I may be trying too hard.

Is there a simple shell command / pipeline that will split my 4mb .txt file into seperate .txt files, based on a beginning and ending regex?

I provide a short sample of the file below.. so you can see that every "story" starts with the phrase "X of XXX DOCUMENTS", which could be used to split the file.

I think this should be easy and I'd be surprised if bash can't do it - faster than Perl/Py.

Here it is:

                           1 of 999 DOCUMENTS


              Copyright 2011 Virginian-Pilot Companies LLC
                          All Rights Reserved
                   The Virginian-Pilot(Norfolk, VA.)

...



                           3 of 999 DOCUMENTS


                  Copyright 2011 Canwest News Service
                          All Rights Reserved
                          Canwest News Service

...

Thanks in advance for all your help.

Ross

Please edit and remove about 95% of the text in your question. — Dennis Williamson, Feb 10 '11 at 00:30
possible duplicate of [Split one file into multiple files based on delimiter](http://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter) — tripleee, Jun 28 '13 at 04:49

score 22 · Accepted Answer · edited Jul 07 '17 at 11:00

22

awk '/[0-9]+ of [0-9]+ DOCUMENTS/{g++} { print $0 > g".txt"}' file

OSX users will need gawk, as the builtin awk will produce an error like awk: illegal statement at source line 1

Ruby(1.9+)

#!/usr/bin/env ruby
g=1
f=File.open(g.to_s + ".txt","w")
open("file").each do |line|
  if line[/\d+ of \d+ DOCUMENTS/]
    f.close
    g+=1
    f=File.open(g.to_s + ".txt","w")
  end
  f.print line
end

edited Jul 07 '17 at 11:00

Ian

11,280
3
36
58

answered Feb 10 '11 at 01:19

kurumi

25,121
5
44
52

OH and we have a winner....speed *AND* elegance I spent a really wet summer in 1997 with the O'Reilly sed/awk book. Wish I could recall all that now. I *will* go and get it tmrw. **THANK YOU** – Feb 10 '11 at 01:32
1

This solution puts the matching line in the new file, which answers the question. But if, like me, you want to put the matching line in the old file before starting the new one, you'd do this: `awk '{print $0 > n".txt"} /text to match/ {n++}` – indiv Mar 14 '14 at 01:03
1

Note: on Mac OS X you need `gawk` from e.g. MacPorts for this to work – Thomas Wana Apr 28 '16 at 15:42

score 10 · Answer 2 · answered Feb 10 '11 at 15:57

10

As suggested in other solutions, you could use csplit for that:

csplit csplit.test '/^\.\.\./' '{*}' && sed -i '/^\.\.\./d' xx*

I haven't found a better way to get rid of the reminiscent separator in the split files.

answered Feb 10 '11 at 15:57

raphink

3,625
1
28
39

I can't try right now because on windows, but csplit's man page seems to suggests using %REGEX% instead of /REGEX/ for that: /REGEXP/[OFFSET] copy up to but not including a matching line %REGEXP%[OFFSET] skip to, but not including a matching line – Spikolynn Jul 07 '15 at 17:51

score 1 · Answer 3 · 2011-02-10T16:29:43.963

1

How hard did you try in Perl?

Edit Here is a faster method. It splits the file then prints the part files.

use strict;
use warnings;

my $count = 1;

open (my $file, '<', 'source.txt') or die "Can't open source.txt: $!";

for (split /(?=^.*\d+[^\S\n]*of[^\S\n]*\d+[^\S\n]*DOCUMENTS)/m, join('',<$file>))
{
    if ( s/^.*(\d+)\s*of\s*\d+\s*DOCUMENTS.*(\n|$)//m )
    {
        open (my $part, '>', "Part$1_$count.txt") 
            or die "Can't open Part$1_$count for output: $!";
        print $part $_;
        close ($part);
        $count++;
    }
}
close ($file);

This is the line by line method:

use strict;
use warnings;

open (my $masterfile, '<', 'yourfilename.txt') or die "Can't open yourfilename.txt: $!";

my $count = 1;
my $fh;

while (<$masterfile>) {
    if ( /(?<!\d)(\d+)\s*of\s*\d+\s*DOCUMENTS/ ) {
        defined $fh and close ($fh);
        open ($fh, '>', "Part$1_$count.txt") or die "Can't open Part$1_$count for  output: $!";
        $count++;
        next;
    }
    defined $fh and print $fh $_;
}
defined $fh and close ($fh);
close ($masterfile);

edited Feb 10 '11 at 16:29

answered Feb 10 '11 at 00:38

`$count` is undefined. I suspect you meant `$cnt`. Also, the first time you run through the loop `$fh` is undefined, so you'll get a `Can't use an undefined value as a symbol reference` error/warning when you try to close `$fh`. – CanSpice Feb 10 '11 at 00:42
Way over my head in Perl - not that I'm not trying... Perl , Python, R, bits of Ruby, bash, a little C++ . As well as being a jobbing Doc and trying to do some research... ta for the help. – Feb 10 '11 at 00:59
Might as well put a check on the final close() too – Feb 10 '11 at 01:00
@rosser - oh its not that bad in Perl. A shaved down version could be done from the command line, a so called 1 liner. – Feb 10 '11 at 01:05
Can't use an undefined value as a symbol reference at getfile.pl line 16, <$masterfile> line 1. – Feb 10 '11 at 01:11
@rosser - Good catch! Your right, do you know how to fix it? `defined $fh and print $fh $_;` It was just an untested example, its fixed now. I would probably write it differently for my use. – Feb 10 '11 at 15:04

score 0 · Answer 4 · answered Feb 10 '11 at 00:34

0

regex to match "X of XXX DOCUMENTS" is
\d{1,3} of \d{1,3) DOCUMENTS

reading line by line and starting to write new file upon regex match should be fine.

answered Feb 10 '11 at 00:34

bw_üezi

4,483
4
23
41

score -1 · Answer 5 · answered Feb 10 '11 at 00:36

-1

Untested:

base=outputfile
start=1
pattern='^[[:blank:]]*[[:digit:]]+ OF [[:digit:]]+ DOCUMENTS[[:blank:]]*$

while read -r line
do
    if [[ $line =~ $pattern ]]
    then
        ((start++))
        printf -v filecount '%4d' $start
        >"$base$filecount"    # create an empty file named like foo0001
    fi
    echo "$line" >> "$base$filecount"
done

answered Feb 10 '11 at 00:36

Dennis Williamson

346,391
90
374
439

By the way, the above is pure Bash. Also, I'm sure that Python or Perl would be much faster. – Dennis Williamson Feb 10 '11 at 00:47
1

Can you do it with csplit? csplit -k -z --digits=3 --suffix='%d.TXT' --prefix=FILE *.TXT /'SPLITONTHIS' – Feb 10 '11 at 00:48
@rosser - this is a candidate for split, don't know csplit though – Feb 10 '11 at 01:09
@sln: `split` does fixed size output files rather than regexes. @rosser: `csplit` is a definite possibility. – Dennis Williamson Feb 10 '11 at 01:21

Splitting a large txt file into 200 smaller txt files on a regex using shell script in BASH

5 Answers5