How to split a large text file at a repeating separator, not by line number or exact size

Question

I have around 6,000 smallish text files. Some have only 3 or 4 lines and a few might have 100 or more lines. I thought I would merge them into 1 large file to make reading them easier. A Windows batch file did the merge, adding a "=======' line between each merged file, but the new file is about 50MB in size, with 900,000 lines. Too big. Would like to split them into fifty files, around 1MB each. The split programs I have looked at either split by exact size or by lines. But I don't want a particular text file split between two chunks. So in the example below, I don't want one chunk to end with "Brown fox" and the next one to start with Jumps. In other words, treat everything between the ======= separator as an unbreakable unit.

This is a Windows/DOS file, so there is no need to change the CRLF line endings. This file does not have any special coding for printing, coloring, etc.

Merged file example:

=======  
One  
Two  
=======  
Abc  
=======  
The quick  
Brown Fox  
Jumps  
Over the dog  
=======  
Dfdfasdf  
Eeffee  
  
Eewweew  
Lk klkl Y tyyd  
=======  


I typed this string on a Windows command line to create the 50MB file All.asc

    For %A in (D:\@temp\*.txt) Do @(CAT53 -s %A & Echo =======) >>D:\@temp\All.asc

When I ran this command, (specifying 30 bytes)
split -b30  all.asc BB
The output for the second file (BBab) elooked like this:
==  
Abc  
=======  
The qu

I didn't think checking the size of the all.asc file after each concatenation and aborting if the size exceeded 1MB would be very efficient. I thought a solution involving merging and then splitting would be simplier and could be reused.

I have the unix utilities on my PC, but not sure if sed or awk or split would be useful. The GSplit utility doesn't seem to do what I need.

please review [how to format](https://stackoverflow.com/help/formatting) and then reformat your question — markp-fuso, Feb 21 '23 at 21:19
please update the question with the 'split' code you've tried so far and explain how it failed and/or did not do what you want — markp-fuso, Feb 21 '23 at 21:19
does the file contain any non-printing characters (eg, color codes, escape codes, control characters)? does the file contain any multi-byte characters? — markp-fuso, Feb 21 '23 at 21:21
Potential responders, there should be no reason to simply post a sed or awk commandline exactly as used in Unix, without confirming that it is the correct format and syntax for the stated Windows OS, and if required, the third party Unix Utilities files. — Compo, Feb 22 '23 at 00:04
If your main concern is to ensure the original file lines are not interrupted when you split the merged file and neither the number of lines nor byte size are critical, why not simply process the merged file in `awk`, setting the record separator to `=======`, and home in on how many records to print to part files by estimation and refinement, starting with, say, 120 records per split (120 being 1/50th of 600). — Dave Pritlove, Feb 22 '23 at 01:23
I think you made wrong decision about using **batch-file** instead of **Powershell**. Same [task](https://stackoverflow.com/questions/1001776/) had been solved by PS. — Daemon-5, Feb 22 '23 at 05:31
Instead of creating a large file and then splitting it up into smaller files, why not just create the smaller files in the first place by, while reading the original very small input files, starting a new output file every time the total input length reaches 1M chars or similar? — Ed Morton, Feb 22 '23 at 16:50
I wondered about the time/overhead in creating smaller files in the first place instead of chopping up the files afterwards. I could do something like FOR /F "usebackq" %%A IN ('%file%') DO set size=%%~zA after copying each file, and then continuing to copy if the size was less then 1MB. I also thought about creating a bunch of folders and moving 1MB worth of files into each folder. — user2574126, Feb 22 '23 at 20:25
@Daemon-5: All the PS solutions posted at your link does _not_ manage the OP's requirement of preserve blocks of lines delimited by `========`. Besides, I am pretty sure that a complete PS solution would be slower than an equivalent Batch-file one when process a big file... — Aacini, Feb 23 '23 at 03:39
@user2574126: Yes, that is the same method I used in [my solution](https://stackoverflow.com/a/75528316/778560)... — Aacini, Feb 23 '23 at 03:52

potong · Answer 1 · 2023-02-22T12:28:18.740

1

This might work for you (GNU parallel):

cat file | parallel --pipe --recstart '=======' cat \>part{#}

Pipe the file into parallel.

The default block size is 1M and the --recstart ensures that file is split on =======.

The ouput files are named part1 to part50.

BTW file could be created by:

sed -s '1i\=======' *.txt >file

edited Feb 22 '23 at 12:28

answered Feb 22 '23 at 01:51

potong

55,640
6
51
83

score 1 · Accepted Answer · answered Feb 22 '23 at 04:36

1

This Batch file do exactly what you want. Just set the desired output files size in partSize variable.

@echo off
setlocal EnableDelayedExpansion

set /A partSize=30, part=101, last=0

del part*.txt 2> NUL
echo Creating part # %part:~1%
< all.asc (
for /F "delims=:" %%n in ('findstr /N /B "=======" all.asc') do (
   set /A "lines=%%n-last, last=%%n"
   (for /L %%i in (1,1,!lines!) do (
      set "line="
      set /P "line="
      echo(!line!
   )) >> part!part:~1!.txt
   for %%f in (part!part:~1!.txt) do (
      if %%~Zf gtr %partSize% (
         set /A part+=1
         echo Creating part # !part:~1!
      )
   )
))

answered Feb 22 '23 at 04:36

Aacini

65,180
12
72
108

Could you get and report the time this program takes to process your 50 MB file? – Aacini Feb 23 '23 at 03:47
It took 4 minutes to complete on my Windows 11 Dell PC. 49 files were created, the smallest was 784K and the largest was 1.41MB. – user2574126 Feb 23 '23 at 21:11
The part size was 990000. – user2574126 Feb 24 '23 at 17:40
May I ask you a favor? Test [this program](https://stackoverflow.com/a/1002749/778560) with your 50 MB file and report the time. Yes, I know that it will not correctly keep the blocks of lines, but I just want to compare the timing... – Aacini Feb 24 '23 at 18:56

How to split a large text file at a repeating separator, not by line number or exact size

2 Answers2