Split one file into multiple files based on delimiter

Question

I have one file with -| as delimiter after each section...need to create separate files for each section using unix.

example of input file

wertretr
ewretrtret
1212132323
000232
-|
ereteertetet
232434234
erewesdfsfsfs
0234342343
-|
jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

Expected result in File 1

wertretr
ewretrtret
1212132323
000232
-|

Expected result in File 2

ereteertetet
232434234
erewesdfsfsfs
0234342343
-|

Expected result in File 3

jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

Are you writing a program or do you want to do this using command line utilities? — rkyser, Jul 03 '12 at 15:13
You could use awk, it would be easy to write a 3 or 4 line program to do it. Unfortunately I am out of practice. — ctrl-alt-delor, Jul 03 '12 at 15:41

ctrl-alt-delor · Answer 1 · 2020-07-15T09:36:07.370

114

A one liner, no programming. (except the regexp etc.)

csplit --digits=2  --quiet --prefix=outfile infile "/-|/+1" "{*}"

tested on: csplit (GNU coreutils) 8.30

Notes about usage on Apple Mac

"For OS X users, note that the version of csplit that comes with the OS doesn't work. You'll want the version in coreutils (installable via Homebrew), which is called gcsplit." — @Danial

"Just to add, you can get the version for OS X to work (at least with High Sierra). You just need to tweak the args a bit csplit -k -f=outfile infile "/-\|/+1" "{3}". Features that don't seem to work are the "{*}", I had to be specific on the number of separators, and needed to add -k to avoid it deleting all outfiles if it can't find a final separator. Also if you want --digits, you need to use -n instead." — @Pebbl

edited Jul 15 '20 at 09:36

answered Jul 03 '12 at 16:07

ctrl-alt-delor

7,506
5
40
52

1

+1 - shorter: `csplit -n2 -s -b outfile infile "/-|/+1" "{*}"` – zb226 May 28 '14 at 12:04
32

@zb226 I did it in long, so that no explanation was needed. – ctrl-alt-delor Jun 07 '14 at 10:45
5

I suggest to add `--elide-empty-files`, otherwise there will be a empty file at the end. – luator Nov 20 '14 at 15:25
13

Just for those who wonder what the parameters mean: `--digits=2` controls the number of digits used to number the output files (2 is default for me, so not necessary). `--quiet` suppresses output (also not really necessary or asked for here). `--prefix` specifies the prefix of the output files (default is xx). So you can skip all the parameters and will get output files like `xx12`. – Christopher K. Jul 26 '17 at 16:22
1

I have updated the question to include the un-read comments about apple mac. – ctrl-alt-delor Jul 15 '20 at 09:57
wooop. it works on Mojave you are a scholar and a gentleman.. one question, how do u add a suffix like .html to the outfile? – AGrush Jul 15 '20 at 10:24
@AGrush I had to look this up, when I answered ( I remembered for when I did my training, but still had to look it up). The first think to note is that Unix does not have suffixes like `.html` (that is there is nothing special about the `.` Try these options `--quiet --prefix="" --suffix-format="prefix%02d.html"` Note I got rid of `--digits` and set `--prefix` to `""` (empty). – ctrl-alt-delor Jul 15 '20 at 16:11
where do you look this up.. seems my version isn't recognising these options.. i cant find any clear guide on this stuff – AGrush Jul 15 '20 at 17:15
`man csplit`. You may need the Gnu version. Gnu likes to improve the tools. – ctrl-alt-delor Jul 16 '20 at 21:44

William Pursell · Answer 2 · 2020-07-15T12:26:38.397

45

awk '{f="file" NR; print $0 " -|"> f}' RS='-\\|'  input-file

Explanation (edited):

RS is the record separator, and this solution uses a gnu awk extension which allows it to be more than one character. NR is the record number.

The print statement prints a record followed by " -|" into a file that contains the record number in its name.

edited Jul 15 '20 at 12:26

answered Jul 03 '12 at 16:04

William Pursell

204,365
48
270
300

How well does this work on really big files (> 3 GB)? I'm not familiar with awk. – rzetterberg Jun 30 '13 at 10:10
Could you please explain the different parts? What is `RS`? What is `NR`? – Martin Thoma Jul 14 '14 at 16:45
1

`RS` is the record separator, and this solution uses a gnu awk extension which allows it to be more than one character. NR is the record number. The print statement prints a record followed by " -|" into a file that contains the record number in its name. – William Pursell Jul 14 '14 at 22:44
1

@rzetterbeg This should work well with large files. awk processes the file one record at a time, so it only reads as much as it needs to. If the first occurrence of the record separator shows up very late in the file, it may be a memory crunch since one whole record must fit into memory. Also, note that using more than one character in RS is not standard awk, but this will work in gnu awk. – William Pursell Dec 22 '14 at 15:59
4

For me it split 3.3 GB in 31.728s – Cleankod May 25 '15 at 13:18
How to customize the file extension (e.g. `file1.txt`, `file2.txt`, etc)? – Quinn Apr 07 '16 at 15:30
4

@ccf The filename is just the string on the right side of the `>`, so you can construct it however you like. eg, `print $0 "-|" > "file" NR ".txt"` – William Pursell Apr 07 '16 at 17:37
throws error: awk: syntax error at source line 1 context is {print $0 " -|"> "file" >>> NR <<< } awk: illegal statement at source line 1 – AGrush Jul 15 '20 at 09:26
1

@AGrush That is version dependent. You can do `awk '{f="file" NR; print $0 " -|" > f}'` – William Pursell Jul 15 '20 at 12:25
thank you, also how can i add an extension to the file name like '.html'? – AGrush Jul 15 '20 at 12:38
1

@AGrush `f="file" NR ".html"...` It's not the cleanest syntax, but in awk you concatenate strings by placing them next to each other with no operator. Alternatively, you can use `sprintf` – William Pursell Jul 15 '20 at 12:40
$ awk '{f="file" NR; ".html" print $0 " -|" > f}' x.txt awk: syntax error at source line 1 context is {f="file" NR; ".html" >>> print – AGrush Jul 15 '20 at 15:24
also i get this error without adding that ".html": awk: file18 makes too many open files input record number 18, file x.txt – AGrush Jul 15 '20 at 15:26

score 7 · Answer 3 · answered Jul 03 '12 at 15:42

7

Debian has csplit, but I don't know if that's common to all/most/other distributions. If not, though, it shouldn't be too hard to track down the source and compile it...

answered Jul 03 '12 at 15:42

twalberg

59,951
11
89
84

1

I agree. My Debian box says that csplit is part of gnu coreutils. So any Gnu operating system, such as all the Gnu/Linux distros will have it. Wikipedia also mentions 'The Single UNIX® Specification, Issue 7' on the csplit page, so I suspect you got it. – ctrl-alt-delor Jul 03 '12 at 15:52
3

Since [`csplit`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/csplit.html) is in POSIX, I would expect it to be available on essentially all Unix-like systems. – Jonathan Leffler Jul 03 '12 at 15:54
1

Although csplit is POISX, the problem (it seems doing a test with it on the Ubuntu system sitting in front of me) is that there is no obvious way to make it use a more modern regex syntax. Compare: `csplit --prefix gold-data - "/^==*$/` vs `csplit --prefix gold-data - "/^=+$/`. At least GNU grep has `-e`. – new123456 Sep 14 '13 at 17:09

score 5 · Answer 4 · edited Mar 25 '15 at 23:48

I solved a slightly different problem, where the file contains a line with the name where the text that follows should go. This perl code does the trick for me:

#!/path/to/perl -w

#comment the line below for UNIX systems
use Win32::Clipboard;

# Get command line flags

#print ($#ARGV, "\n");
if($#ARGV == 0) {
    print STDERR "usage: ncsplit.pl --mff -- filename.txt [...] \n\nNote that no space is allowed between the '--' and the related parameter.\n\nThe mff is found on a line followed by a filename.  All of the contents of filename.txt are written to that file until another mff is found.\n";
    exit;
}

# this package sets the ARGV count variable to -1;

use Getopt::Long;
my $mff = "";
GetOptions('mff' => \$mff);

# set a default $mff variable
if ($mff eq "") {$mff = "-#-"};
print ("using file switch=", $mff, "\n\n");

while($_ = shift @ARGV) {
    if(-f "$_") {
    push @filelist, $_;
    } 
}

# Could be more than one file name on the command line, 
# but this version throws away the subsequent ones.

$readfile = $filelist[0];

open SOURCEFILE, "<$readfile" or die "File not found...\n\n";
#print SOURCEFILE;

while (<SOURCEFILE>) {
  /^$mff (.*$)/o;
    $outname = $1;
#   print $outname;
#   print "right is: $1 \n";

if (/^$mff /) {

    open OUTFILE, ">$outname" ;
    print "opened $outname\n";
    }
    else {print OUTFILE "$_"};
  }

Can you please explain why this code works? I have a similar situation to what you've described here - the required output file names are embedded inside the file. But I'm not a regular perl user so can't quite make sense of this code. — shiri, Mar 13 '17 at 15:33
The real beef is in the final `while` loop. If it finds the `mff` regex at beginning of line, it uses the rest of the line as the filename to open and start writing to. It never closes anything so it will run out of file handles after a few dozen. — tripleee, Nov 07 '18 at 10:04
The script would actually be improved by removing most of the code before the final `while` loop and switching to `while (<>)` — tripleee, Nov 07 '18 at 10:05

score 4 · Answer 5 · edited Nov 07 '18 at 09:57

4

The following command works for me. Hope it helps.

awk 'BEGIN{file = 0; filename = "output_" file ".txt"}
    /-|/ {getline; file ++; filename = "output_" file ".txt"}
    {print $0 > filename}' input

edited Nov 07 '18 at 09:57

tripleee

175,061
34
275
318

answered Feb 07 '17 at 19:40

Thanh

49
1

1

This will run out of file handles after typically a few dozen files. The fix is to explicitly `close` the old file when you start a new one. – tripleee Oct 23 '18 at 15:20
@tripleee how do you close it (beginner awk question). Can you provide an updated example? – Jesper Rønn-Jensen Nov 07 '18 at 09:51
1

@JesperRønn-Jensen This box is probably too small for any useful example but basically `if (file) close(filename);` before assigning a new `filename` value. – tripleee Nov 07 '18 at 09:53
aah found out how to close it: `; close(filename)`. Really simple, but it really fixes the example above – Jesper Rønn-Jensen Nov 07 '18 at 09:53
Thanks @tripleee for the quick and helpful explanation :) – Jesper Rønn-Jensen Nov 07 '18 at 09:57
1

@JesperRønn-Jensen I rolled back your edit because you provided a broken script. Significant edits to other people's answers should probably be avoided -- feel free to post a new answer of your own (perhaps as a [community wiki](https://meta.stackexchange.com/questions/11740/what-are-community-wiki-posts)) if you think a separate answer is merited. – tripleee Nov 07 '18 at 09:59
throws error: awk: illegal primary in regular expression -| at source line number 2 context is >>> /-|/ << – AGrush Jul 15 '20 at 09:27

score 3 · Answer 6 · answered Jul 03 '12 at 16:00

You can also use awk. I'm not very familiar with awk, but the following did seem to work for me. It generated part1.txt, part2.txt, part3.txt, and part4.txt. Do note, that the last partn.txt file that this generates is empty. I'm not sure how fix that, but I'm sure it could be done with a little tweaking. Any suggestions anyone?

awk_pattern file:

BEGIN{ fn = "part1.txt"; n = 1 }
{
   print > fn
   if (substr($0,1,2) == "-|") {
       close (fn)
       n++
       fn = "part" n ".txt"
   }
}

bash command:

awk -f awk_pattern input.file

score 2 · Answer 7 · answered Feb 19 '17 at 19:33

Here's a Python 3 script that splits a file into multiple files based on a filename provided by the delimiters. Example input file:

# Ignored

######## FILTER BEGIN foo.conf
This goes in foo.conf.
######## FILTER END

# Ignored

######## FILTER BEGIN bar.conf
This goes in bar.conf.
######## FILTER END

Here's the script:

#!/usr/bin/env python3

import os
import argparse

# global settings
start_delimiter = '######## FILTER BEGIN'
end_delimiter = '######## FILTER END'

# parse command line arguments
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input filename")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")

args = parser.parse_args()

# read the input file
with open(args.input_file, 'r') as input_file:
    input_data = input_file.read()

# iterate through the input data by line
input_lines = input_data.splitlines()
while input_lines:
    # discard lines until the next start delimiter
    while input_lines and not input_lines[0].startswith(start_delimiter):
        input_lines.pop(0)

    # corner case: no delimiter found and no more lines left
    if not input_lines:
        break

    # extract the output filename from the start delimiter
    output_filename = input_lines.pop(0).replace(start_delimiter, "").strip()
    output_path = os.path.join(args.output_dir, output_filename)

    # open the output file
    print("extracting file: {0}".format(output_path))
    with open(output_path, 'w') as output_file:
        # while we have lines left and they don't match the end delimiter
        while input_lines and not input_lines[0].startswith(end_delimiter):
            output_file.write("{0}\n".format(input_lines.pop(0)))

        # remove end delimiter if present
        if not input_lines:
            input_lines.pop(0)

Finally here's how you run it:

$ python3 script.py -i input-file.txt -o ./output-folder/

score 2 · Answer 8 · edited Oct 24 '18 at 04:09

2

Use csplit if you have it.

If you don't, but you have Python... don't use Perl.

Lazy reading of the file

Your file may be too large to hold in memory all at once - reading line by line may be preferable. Assume the input file is named "samplein":

$ python3 -c "from itertools import count
with open('samplein') as file:
    for i in count():
        firstline = next(file, None)
        if firstline is None:
            break
        with open(f'out{i}', 'w') as out:
            out.write(firstline)
            for line in file:
                out.write(line)
                if line == '-|\n':
                    break"

edited Oct 24 '18 at 04:09

tripleee

175,061
34
275
318

answered Oct 24 '17 at 20:10

Russia Must Remove Putin

374,368
89
403
331

This will read the entire file into memory, which means it will be inefficient or even fail for large files. – tripleee Oct 23 '18 at 15:18
1

@tripleee I have updated the answer to handle very large files. – Russia Must Remove Putin Oct 23 '18 at 21:47

mbonnin · Answer 9 · 2012-07-03T15:56:48.907

1

cat file| ( I=0; echo -n "">file0; while read line; do echo $line >> file$I; if [ "$line" == '-|' ]; then I=$[I+1]; echo -n "" > file$I; fi; done )

and the formated version:

#!/bin/bash
cat FILE | (
  I=0;
  echo -n"">file0;
  while read line; 
  do
    echo $line >> file$I;
    if [ "$line" == '-|' ];
    then I=$[I+1];
      echo -n "" > file$I;
    fi;
  done;
)

edited Jul 03 '12 at 15:56

answered Jul 03 '12 at 15:49

mbonnin

6,893
3
39
55

4

As ever, [the `cat` is Useless](http://www.iki.fi/era/unix/award.html). – tripleee Aug 16 '16 at 04:47
1

@Reishin The linked page explains in much more detail how you can avoid `cat` on a single file in every situation. There is a Stack Overflow question with more discussion (though the accepted answer is IMHO off); https://stackoverflow.com/questions/11710552/useless-use-of-cat – tripleee Oct 23 '18 at 09:16
1

The shell is typically very inefficient at this sort of thing anyway; if you can't use `csplit`, an Awk solution is probably much preferrable to this solution (even if you were to fix the problems reported by http://shellcheck.net/ etc; note that it doesn't currently find all the bugs in this). – tripleee Oct 23 '18 at 09:20
@tripleee but if the task is to do it without awk, csplit and etc - only bash? – Reishin Oct 23 '18 at 14:00
1

Then the `cat` is still useless, and the rest of the script could be simplified and corrected a good deal; but it will still be slow. See e.g. https://stackoverflow.com/questions/13762625/bash-while-read-line-extremely-slow-compared-to-cat-why – tripleee Oct 23 '18 at 14:24
just outputs one file, the same one but without the .txt – AGrush Jul 15 '20 at 09:29

score 0 · Answer 10 · edited Mar 25 '15 at 23:10

0

Here is a perl code that will do the thing

#!/usr/bin/perl
open(FI,"file.txt") or die "Input file not found";
$cur=0;
open(FO,">res.$cur.txt") or die "Cannot open output file $cur";
while(<FI>)
{
    print FO $_;
    if(/^-\|/)
    {
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die "Cannot open output file $cur"
    }
}
close(FO);

edited Mar 25 '15 at 23:10

jia103

1,116
2
13
20

answered Jul 03 '12 at 16:00

amaksr

7,555
2
16
17

score 0 · Answer 11 · answered Jul 03 '12 at 17:17

0

This is the sort of problem I wrote context-split for: http://stromberg.dnsalias.org/~strombrg/context-split.html

$ ./context-split -h
usage:
./context-split [-s separator] [-n name] [-z length]
        -s specifies what regex should separate output files
        -n specifies how output files are named (default: numeric
        -z specifies how long numbered filenames (if any) should be
        -i include line containing separator in output files
        operations are always performed on stdin

answered Jul 03 '12 at 17:17

user1277476

2,871
12
10

Uh, this looks like essentially a duplicate of the standard `csplit` utility. See [@richard's answer](http://stackoverflow.com/a/11314918/874188). – tripleee Aug 16 '16 at 04:46
This is actually the best solution imo. I've had to split a 98G mysql dump and csplit for some reason eats up all my RAM, and is killed. Even though it should only need to match one line at the time. Makes no sense. This python script works much better and doesn't eat up all the ram. – Stefan Midjich Feb 20 '18 at 23:56

score 0 · Answer 12 · answered Dec 02 '22 at 15:58

Try this python script:

import os
import argparse

delimiter = '-|'

parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input txt")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")

args = parser.parse_args()

counter = 1;
output_filename = 'part-'+str(counter)
with open(args.input_file, 'r') as input_file:
    for line in input_file.read().split('\n'):
        if delimiter in line:
            counter = counter+1
            output_filename = 'part-'+str(counter)
            print('Section '+str(counter)+' Started')
        else:
            #skips empty lines (change the condition if you want empty lines too)
            if line.strip() :
                output_path = os.path.join(args.output_dir, output_filename+'.txt')
                with open(output_path, 'a') as output_file:
                    output_file.write("{0}\n".format(line))

ex:

python split.py -i ./to-split.txt -o ./output-dir

Split one file into multiple files based on delimiter

12 Answers12

Notes about usage on Apple Mac

Lazy reading of the file

Linked

Related