-1

I have a file that contains about 5GB worth of text. I would like to break this up into multiple smaller files based on each time a specific string comes up.

What I am trying to do is put together a script or even a one liner that reads through the file looking for <ticket> and </ticket> and each time it finds those it copies those two lines plus the content in between to a new file. Each time it finds a match for those strings I need it to create a new file though. What thought would do this for me is something like:

#!/bin/bash


i=0
for f in BF_File.xml; do
        let i++;
        sed 's#.*<ticket>\(.*\)</ticket>#\1#' "$f" > Smaller_File_"${i}".xml
done

but this just ends up copying the contents of the original file into a Smaller_File_1.xml

Any help would be greatly appreciated!

TheLordRev
  • 13
  • 1

4 Answers4

1

You can't parse XML with RegEx. Please use a dedicated XML-parser like instead.
With its integrated EXPath File Module you can do this all in one go:

$ xidel -s BF_File.xml -e '
  for $x at $i in //ticket return
  file:write(x"Smaller_File_{$i}.xml",$x)
'

$ xidel -s Smaller_File_1.xml Smaller_File_2.xml Smaller_File_3.xml -e '$raw'
<ticket>foo</ticket>
<ticket>bar</ticket>
<ticket>base</ticket>
Reino
  • 3,203
  • 1
  • 13
  • 21
0

Assuming the <ticket> and </ticket> pairs are complete, meaning neither of them is missing, would you please try the awk script:

awk '
    /<ticket>/ {f = 1; file = sprintf("Smaller_File_%05d.xml", ++c)}
    f {print > file}
    /<\/ticket>/ {f = 0; close(file)}
' BF_File.xml

or a one-liner (less readable):

awk '/<ticket>/ {f = 1; file = sprintf("Smaller_File_%05d.xml", ++c)} f {print > file} /<\/ticket>/ {f = 0; close(file)}' BF_File.xml
  • If a line matches <ticket> set f and open a new file.
  • If f is set, print the line to the file.
  • If a line matches </ticket>, reset f and close the file.

Please modify the number of digits %05d according to the possible count of splitted files.

tshiono
  • 21,248
  • 2
  • 14
  • 22
0

Your code is:

i=0
for f in BF_File.xml; do
        let i++;
        sed 's#.*<ticket>\(.*\)</ticket>#\1#' "$f" > Smaller_File_"${i}".xml
done
  • for f in BF_File.xml only runs once
  • only a single assignment to i will happen
  • sed regex are greedy: .* matches the longest possible string (which can include ...</ticket>...<ticket>...)
  • sed operates on individual lines, so the s command will only match <ticket> and </ticket> appear on the same line

You say the code "copies those two lines plus the content in between" but your sed command would throw away the two lines. You'd need something more like: s#.*\(<ticket>.*</ticket>\)#\1# (but it would still fail for the previous reasons)


With gawk, you can specify that RS is a regex.

Assuming input has the form:

a<ticket>b</ticket>c<ticket>d</ticket>e<ticket>f</ticket>g...

then setting RS='</?ticket>' gives records: a b c d e f g ...

from which the odd elements can be discarded to leave: b d f ...

gawk sets RT to the actual value of RS when each record is read so this can be saved and used to wrap the records on output.

gawk -v RS='</?ticket>' '
    !(NR%2) {
        out = "small" (++n) ".xml"
        print rt $0 RT > out
        close out
    }
    { rt = RT }
' big.xml
jhnc
  • 11,310
  • 1
  • 9
  • 26
  • Note that this code assumes that ``/`` cannot appear except as delimiters. In actual XML that is not guaranteed. Using an XML parsing tool would be more reliable. – jhnc Apr 18 '23 at 05:45
  • I'll give this a try, always good to have multiple ways to do something. – TheLordRev Apr 18 '23 at 21:55
0

Using a XML parser: xmllint and a shell while loop:

cat file.xml
<root>
<ticket>foo</ticket>
<ticket>bar</ticket>
<ticket>base</ticket>
</root>
i=1
while IFS= read -r val; do 
    echo "$val" | tee "Smaller_File_$(printf '%.5d' $i).xml"
    ((i++))
done < <(xmllint --xpath '//ticket/text()' file.xml)
foo
bar
base
ls -1 Smaller_File_0*
Smaller_File_00001.xml
Smaller_File_00002.xml
Smaller_File_00003.xml
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223