28

I'm having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:

dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

asdasd #299 yadayada 60 40
content
content
contend done
...and so on

A typical information table in my file has anywhere between 10-40 rows.

I would like this file to be split in n smaller files, where n is the amount of content tables.

That is

dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

would be its own separate file, (whateverN.txt)

and

asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

again a separate file whateverN+1.txt and so forth.

It seems like awk or Perl are nifty tools for this, but having never used them before the syntax is kinda baffling.

I found these two questions that are almost correspondent to my problem, but failed to modify the syntax to fit my needs:

Split text file into multiple files & How can I split a text file into multiple text files? (on Unix & Linux)

How should one modify the command line inputs, so that it solves my problem?

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
tropical e
  • 389
  • 1
  • 3
  • 3
  • 2
    I bet you need to learn how to use them (awk, perl, or whatever) a little, before you try to use them to solve your problems. – Lee Duhem Oct 23 '15 at 04:54
  • Or is there a language you do know that you can attempt a solution in? – mwp Oct 23 '15 at 05:02
  • It would be best if you edit to post some examples using the code block like in your linked ones, of your input and desired output. – Nick P Oct 23 '15 at 05:12
  • Choose a language and first try from yourself. And still if you have problem then come here with your attempt. – serenesat Oct 23 '15 at 05:26

9 Answers9

42

Setting RS to null tells awk to use one or more blank lines as the record separator. Then you can simply use NR to set the name of the file corresponding to each new record:

 awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt

RS: This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text.

$ cat file.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

asdasd #299 yadayada 60 40
content
content
contend done

$ awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt

$ ls whatever-*.txt
whatever-1.txt  whatever-2.txt  whatever-3.txt

$ cat whatever-1.txt 
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion

$ cat whatever-2.txt 
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion

$ cat whatever-3.txt 
asdasd #299 yadayada 60 40
content
content
contend done
$ 
jas
  • 10,715
  • 2
  • 30
  • 41
  • 2
    How do we save it in a variable array? – Chand Jul 17 '17 at 18:16
  • 2
    Simple solution, nice! If you want to pass the output filename pattern as a variable, you may choose the following: `awk -v RS= -v PATTERN="whatever-%d.txt" '{FILE=sprintf(PATTERN, NR); print > FILE}' $filename` – Erwin411 Dec 11 '18 at 10:03
  • For huge files, the input record may not fit into memory (>20 GB in my case). So a line-oriented solution is preferred, see @sat's answer. My final solution is: `awk -v PATTERN="whatever-%d.txt" 'BEGIN {n=1; FILE=sprintf(PATTERN, n)} !NF {n++; FILE=sprintf(PATTERN, n); next} {print > FILE}' $filename` – Erwin411 Dec 11 '18 at 15:36
  • 2
    Be advised, you might have too many file-handles opened this way. Only gnu awk will automatically address this issue. A better version would be: `awk -v RS= '{f="whatever=" NR ".txt"; print > f; close(f)}' file` – kvantour Dec 02 '19 at 11:22
  • Nice. I was looking for something easy to set up, as my input file isn't too big, this fit the bill just right, short yet clear for me to adjust as needed – Mr Redstoner Dec 27 '19 at 19:52
  • awk is a very powerful programming language for dealing with text, this is a good example. For more on awk I recommend Sec. 4.4 of the *Unix programming Environment* by Kernighan and Pike and *More Programming Pearls* by Bentley. – George Co Mar 25 '21 at 23:31
9

You could use the csplit command:

csplit \
    --quiet \
    --prefix=whatever \
    --suffix-format=%02d.txt \
    --suppress-matched \
    infile.txt /^$/ {*}

POSIX csplit only uses short options and doesn't know --suffix and --suppress-matched, so this requires GNU csplit.

This is what the options do:

  • --quiet – suppress output of file sizes
  • --prefix=whatever – use whatever instead fo the default xx filename prefix
  • --suffix-format=%02d.txt – append .txt to the default two digit suffix
  • --suppress-matched – don't include the lines matching the pattern on which the input is split
  • /^$/ {*} – split on pattern "empty line" (/^$/) as often as possible ({*})
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
3

Perl has a useful feature called the input record separator. $/.

This is the 'marker' for separating records when reading a file.

So:

#!/usr/bin/env perl
use strict;
use warnings;

local $/ = "\n\n"; 
my $count = 0; 

while ( my $chunk = <> ) {
    open ( my $output, '>', "filename_".$count++ ) or die $!;
    print {$output} $chunk;
    close ( $output ); 
}

Just like that. The <> is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed or grep work.

This can be reduced to a one liner:

perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;'  yourfilename_here
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • -00? Well that's something new. But I do try to avoid one liners :) – Nick P Oct 23 '15 at 11:43
  • 1
    I do generally, but when we're in an `awk` race, I try and include them for comparison. (But as much as possible _after_ some code that illustrates more clearly). – Sobrique Oct 23 '15 at 11:47
  • Thank you! This was it! However, at first running this command resulted in the same scenario I had with other scripts. The reason apparently was that my input data files (each of them 4-8M lines long) had incorrect line separators or something wack. Whenever I opened them on any text editor, they would look like fine. But running this command resulted in a single file, identical to the input file. But after I copy-pasted (ugh) every data set into blank page on text editor, and hit save, their file sized would alter a little (like 1M on a 150MB file) and after that this command ran just fine. – tropical e Oct 26 '15 at 21:45
2

You can use this awk,

awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile

(OR)

awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile

More readable format:

BEGIN {
        file="content"++i".txt"
}
!NF {
        file="content"++i".txt";
        next
}
{
        print > file
}
sat
  • 14,589
  • 7
  • 46
  • 65
  • Instead of `$0 ~ /^$/` you could just use `/^$/` or more commonly `!NF`. You want `print > file`, not `print >> file` - shell and awk have different semantics for `>` vs `>>`. – Ed Morton Oct 23 '15 at 10:25
  • 1
    @EdMorton, You're correct. Updated. Thank you for the hint ( `shell` and `awk` have different semantics for `>` vs `>>` ). – sat Oct 23 '15 at 11:12
  • Use `print > ("filename"i".txt")` instead of `print > "filename"i".txt"` as the meaning of that statement is undefined in POSIX and some awks wll treat it as `(print > "filename") i".txt"` or something else undesirable. – Ed Morton Oct 23 '15 at 14:53
  • also add a line to close the file – user1778602 Jan 26 '19 at 16:53
1

In case you get "too many open files" error as follows...

awk: whatever-18.txt makes too many open files
 input record number 18, file file.txt
 source line number 1

You may need to close newly created file, before creating a new one, as follows.

awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
0

Since it's Friday and I'm feeling a bit helpful... :)

Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.

use strict;
use warnings;

# slurp file
local $/ = undef;
open my $fh, '<', 'test.txt' or die $!;
my $text = <$fh>;
close $fh;

# split on double new line
my @chunks = split(/\n\n/, $text);

# make new files from chunks
my $count = 1;
for my $chunk (@chunks) {
    open my $ofh, '>', "whatever$count.txt" or die $!;
    print $ofh $chunk, "\n";
    close $ofh;
    $count++;
}

The perl docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.

Nick P
  • 759
  • 5
  • 20
0
awk -v RS="\n\n" '{for (i=1;i<=NR;i++); print > i-1}' file.txt

Sets record separator as blank line, prints each record as a separate file numbered 1, 2, 3, etc. Last file (only) ends in blank line.

user2138595
  • 187
  • 7
  • Using multiple chars for RS makes this gawk specific but you should be usign `RS=""` anyway. Also always parenthesize the right side of output redirection as some awks will interpret `print i-1` as `(print i) -i`. Most importantly though - the logic is wrong and it'll print NR occurrences of each record. – Ed Morton Oct 23 '15 at 10:22
0

Try this bash script also

#!/bin/bash
i=1
fileName="OutputFile_$i"
while read line ; do 
if [ "$line"  == ""  ] ; then
 ((++i))
 fileName="OutputFile_$i"
else
 echo $line >> "$fileName"
fi
done < InputFile.txt
Kalanidhi
  • 4,902
  • 27
  • 42
  • That will corrupt the contents of his input file and produce different output based on the contents of the input file plus the contents of whatever directory you are running it from. Do NOT write shell loops just to manipulate text. See http://unix.stackexchange.com/q/169716/133219 – Ed Morton Oct 23 '15 at 10:20
0

You can also try split -p "^$"

Nuno Silva
  • 728
  • 11
  • 27