Split a file using a pattern as a delimiter

Question

I have a 5000 lines file consisting of blocks of lines, with an END string between blocks, as follows

ATOM 1
ATOM 3
ATOM 25
END 
ATOM 2
ATOM 36
ATOM 22
ATOM 12 
END 
ATOM 1
ATOM 87
END

I want to find a way to split the file into several files, each containing a single block of lines before the END string. The first file should look as follows:

ATOM 1
ATOM 3
ATOM 25

The second file should contain

ATOM 2
ATOM 36
ATOM 22
ATOM 12

And so on. I have thought of using something like awk '/END/{flag=1; next} /END/{flag=0} flag' file to take the blocks between the END strings. This, however, does not work for my first block, as the END string is only after the block, and most importantly, cannot take into account the number of times it has found the string END to separate each block into its individual file. Is there a way I can use the string END to split my file into several, each containing a block that ends with the string END?

Are the trailing blanks intentional? – Cyrus Nov 26 '22 at 22:43 — Cyrus, Nov 26 '22 at 22:43

score 2 · Answer 1 · answered Nov 26 '22 at 22:22

2

Close. Increment the flag each block. And output to a file. In awk:

awk 'BEGIN{flag=0} /END/{flag++} {print $0 > flag ".txt"}' file

In Bash:

flag=0
while IFS= read -r line; do
   if [[ "$line" = "END" ]]; then
      flag=$((flag + 1))
   else
      printf "%s\n" "$line" >> "$flag.txt"
   fi
done <inputfile

etc in any other programming language.

answered Nov 26 '22 at 22:22

KamilCuk

120,984
8
59
111

Thank you! The one-line worked perfectly. The one thing is that it prints the END as the first line for all files. Is there a way to avoid this? – user19619903 Nov 26 '22 at 22:41
`/END/{flag++;next}` . don't execute print if there is an end. You could also `/END/!{print...` – KamilCuk Nov 26 '22 at 23:16
3

`print $0 > flag ".txt"` will produce a syntax error in some awks, it should be `print $0 > (flag ".txt")` to behave the same way in all awks. Not `close()`ing the output files as you go will lead to a "too many open files" error in some awks once you get past a threshold. The shell script would, of course, be several orders of magnitude slower than the awk script and would convert some escape sequences to literal chars (e.g. `\t` to a tab) - see [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/q/169716/133219) – Ed Morton Nov 27 '22 at 00:34

score 2 · Answer 2 · answered Nov 26 '22 at 22:51

awk's record separator (RS) can be reset to read blocks separated by the word "END", and each block can be printed to a file with a numerically incremented filename as follows:

awk 'BEGIN{RS="END";ORS="";i=1;} {print > "part"i".file"; i++}' file.txt

The output record separator ORS has been set to an empty string to prevent additional new lines at the end of the file. Files after the first part still have a leading empty line that could be removed if essential. It also creates an additional empty file that can be ignored for this 'quick and dirty' solution.

An incremented counter i is used to form sequential file names.

output examined from the above procedure run with a file copy of your input:

> ls part*
part1.file  part2.file  part3.file  part4.file
> cat part1.file
ATOM 1
ATOM 3
ATOM 25
>cat part2.file
 
ATOM 2
ATOM 36
ATOM 22
ATOM 12

(part4.file is empty)

possible problem: some versions of awk apparently don't like concatenation for filenames receiving a direct print redirection. If an error occurs here, the filename can be preformed in the slightly longer version:

awk 'BEGIN{RS="END";ORS="";i=1;} {flname="part"i".file"; print > flname; i++}' file.txt

only some awk versions will accept a multi-char RS, e.g. GNU awk, and those would, I expect, be OK with no parens around the expression on the right side of redirection. With all other awks `RS="END"` will be treated like `RS="E"`, `print > "part"i".file"` will be a syntax error, and not `close()`ing the output files as you go will lead to a "too many open files" error once you get past a threshold. — Ed Morton, Nov 27 '22 at 00:08

score 2 · Answer 3 · answered Nov 27 '22 at 00:06

2

Using any awk:

$ awk -v cnt=1 '
    /END/ { cnt++; next }
    cnt != prev { close(out); out="foo" cnt ".txt"; prev=cnt }
    { print > out }
' file

$ head foo*.txt
==> foo1.txt <==
ATOM 1
ATOM 3
ATOM 25

==> foo2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12

==> foo3.txt <==
ATOM 1
ATOM 87

answered Nov 27 '22 at 00:06

Ed Morton

188,023
17
78
185

Maybe be better `$1=="END" { cnt++; next }` otherwise `END` anywhere will trigger the next file... – dawg Nov 27 '22 at 00:22
1

@dawg agreed that'd be more robust, I just didn't do that since it wouldn't matter given the OPs sample input and I didn't want it to look like my solution required it when the others didn't for the same input. – Ed Morton Nov 27 '22 at 00:30

karakfa · Answer 4 · 2022-11-27T00:45:09.780

1

$ awk '/END/{c++; next} {print > ("file."(c+1)".txt")}' file



==> file.1.txt <==
ATOM 1
ATOM 3
ATOM 25

==> file.2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12

==> file.3.txt <==
ATOM 1
ATOM 87

If you have too many sections eventually may run into too many files open issue. So, better to close the files when done.

$ awk 'BEGIN {f="file."(++c)".txt"} 
       /END/ {close(f); f="file"(++c)".txt"; next} 
             {print > f}' file

edited Nov 27 '22 at 00:45

answered Nov 27 '22 at 00:00

karakfa

66,216
7
41
56

1

Yes, right! Fixed. – karakfa Nov 27 '22 at 00:45

potong · Answer 5 · 2022-11-27T08:07:59.460

1

This might work for you (GNU csplit):

csplit -qz -f file -b '%04d.txt' --suppress-matched file '/END/' '{*}'

Be quiet and elide any empty files.

Prefix the output files with file and suffix with four digits plus .txt.

Suppress the matching lines e.g. END.

Repeat until the end of the file.

If you do not mind files defaulting to xxn use:

csplit -qz --sup file '/END/' '{*}'

edited Nov 27 '22 at 08:07

answered Nov 27 '22 at 08:01

potong

55,640
6
51
83

dawg · Answer 6 · 2022-11-27T20:31:47.913

Few other ways.

Perl:

perl -0777 -lnE 'while (/([\s\S]*?)^END\s*/gm) {
    $cnt++;
    open(FH, ">file_${cnt}.txt");
    print FH $1;
    close (FH);
}' file

Ruby:

ruby -e 'cnt=1; s=$<.read.scan(/([\s\S]*?)^END\s*/m) { |b|
    File.write("file_#{cnt}.txt", b.join(""))
    cnt+=1
}' file

Any awk:

awk 'BEGIN { i=1; fn=sprintf("file_%s.txt", i) }
    $1=="END" { close(fn); fn=sprintf("file_%s.txt", ++i); next }
    {print > fn }
' file

Or, you can use sed and process substitution with Bash (Note -- this only works if the file is properly terminated with a final new line.)

while IFS= read -r -d $'\3' block; do
    (( i++ ))
    printf "%s" "$block" > "file_${i}.txt"
done < <(sed '/^END[[:space:]]*$/N; s/^END[[:space:]]*/\x3/' file)

Any of these results in:

head file_*.txt
==> file_1.txt <==
ATOM 1
ATOM 3
ATOM 25

==> file_2.txt <==
ATOM 2
ATOM 36
ATOM 22
ATOM 12 

==> file_3.txt <==
ATOM 1
ATOM 87

# ^ Note final file has proper \n termination

Split a file using a pattern as a delimiter

6 Answers6