How do I break up a file into multiple files based on what is between two strings

Question

I have a file that contains about 5GB worth of text. I would like to break this up into multiple smaller files based on each time a specific string comes up.

What I am trying to do is put together a script or even a one liner that reads through the file looking for <ticket> and </ticket> and each time it finds those it copies those two lines plus the content in between to a new file. Each time it finds a match for those strings I need it to create a new file though. What thought would do this for me is something like:

#!/bin/bash


i=0
for f in BF_File.xml; do
        let i++;
        sed 's#.*<ticket>\(.*\)</ticket>#\1#' "$f" > Smaller_File_"${i}".xml
done

but this just ends up copying the contents of the original file into a Smaller_File_1.xml

Any help would be greatly appreciated!

Thanks to show us **sample input** and expected output – Gilles Quénot Apr 18 '23 at 01:05 — Gilles Quénot, Apr 18 '23 at 01:05

score 1 · Answer 1 · answered Apr 22 '23 at 21:07

You can't parse XML with RegEx. Please use a dedicated XML-parser like xidel instead.
With its integrated EXPath File Module you can do this all in one go:

$ xidel -s BF_File.xml -e '
  for $x at $i in //ticket return
  file:write(x"Smaller_File_{$i}.xml",$x)
'

$ xidel -s Smaller_File_1.xml Smaller_File_2.xml Smaller_File_3.xml -e '$raw'
<ticket>foo</ticket>
<ticket>bar</ticket>
<ticket>base</ticket>

score 0 · Accepted Answer · answered Apr 18 '23 at 01:47

Assuming the <ticket> and </ticket> pairs are complete, meaning neither of them is missing, would you please try the awk script:

awk '
    /<ticket>/ {f = 1; file = sprintf("Smaller_File_%05d.xml", ++c)}
    f {print > file}
    /<\/ticket>/ {f = 0; close(file)}
' BF_File.xml

or a one-liner (less readable):

awk '/<ticket>/ {f = 1; file = sprintf("Smaller_File_%05d.xml", ++c)} f {print > file} /<\/ticket>/ {f = 0; close(file)}' BF_File.xml

If a line matches <ticket> set f and open a new file.
If f is set, print the line to the file.
If a line matches </ticket>, reset f and close the file.

Please modify the number of digits %05d according to the possible count of splitted files.

jhnc · Answer 3 · 2023-04-18T05:55:21.773

Your code is:

i=0
for f in BF_File.xml; do
        let i++;
        sed 's#.*<ticket>\(.*\)</ticket>#\1#' "$f" > Smaller_File_"${i}".xml
done

for f in BF_File.xml only runs once
only a single assignment to i will happen
sed regex are greedy: .* matches the longest possible string (which can include ...</ticket>...<ticket>...)
sed operates on individual lines, so the s command will only match <ticket> and </ticket> appear on the same line

You say the code "copies those two lines plus the content in between" but your sed command would throw away the two lines. You'd need something more like: s#.*\(<ticket>.*</ticket>\)#\1# (but it would still fail for the previous reasons)

With gawk, you can specify that RS is a regex.

Assuming input has the form:

a<ticket>b</ticket>c<ticket>d</ticket>e<ticket>f</ticket>g...

then setting RS='</?ticket>' gives records: a b c d e f g ...

from which the odd elements can be discarded to leave: b d f ...

gawk sets RT to the actual value of RS when each record is read so this can be saved and used to wrap the records on output.

gawk -v RS='</?ticket>' '
    !(NR%2) {
        out = "small" (++n) ".xml"
        print rt $0 RT > out
        close out
    }
    { rt = RT }
' big.xml

Note that this code assumes that ``/`` cannot appear except as delimiters. In actual XML that is not guaranteed. Using an XML parsing tool would be more reliable. — jhnc, Apr 18 '23 at 05:45
I'll give this a try, always good to have multiple ways to do something. — TheLordRev, Apr 18 '23 at 21:55

Gilles Quénot · Answer 4 · 2023-04-18T15:01:26.787

0

Using a XML parser: xmllint and a shell while loop:

cat file.xml
<root>
<ticket>foo</ticket>
<ticket>bar</ticket>
<ticket>base</ticket>
</root>

i=1
while IFS= read -r val; do 
    echo "$val" | tee "Smaller_File_$(printf '%.5d' $i).xml"
    ((i++))
done < <(xmllint --xpath '//ticket/text()' file.xml)
foo
bar
base

ls -1 Smaller_File_0*
Smaller_File_00001.xml
Smaller_File_00002.xml
Smaller_File_00003.xml

edited Apr 18 '23 at 15:01

answered Apr 18 '23 at 14:56

Gilles Quénot

173,512
41
224
223

I'll give this a try. Always happy to have multiple ways to do something. – TheLordRev Apr 18 '23 at 21:56
IMHO this is the only correct answer. Don't parse XML with anything than a XML parser. Usually, the way to thanks is to upvote/accept answers – Gilles Quénot Apr 18 '23 at 21:59

How do I break up a file into multiple files based on what is between two strings

4 Answers4