1

file content

<page width="595.28000000" height="841.89000000">
<background type="pdf" pageno="61"/>
<layer/>
</page>
<page width="595.28000000" height="841.89000000">
<background type="pdf" pageno="62"/>
<layer/>
</page>
<page width="595.28000000" height="841.89000000">
<background type="pdf" pageno="63"/>
<layer/>

I am trying to replace e.g. pageno="62" with pageno="65" and also subsequent page nos i.e. 63->64, 64->65. I am using bash to do this. The file is very big about 930 pages so sed is slow, is there any fast way to do this?

My script

total=$(grep pageno= "$1" | tail -n1 | cut -d'"' -f4)

from="${2}"
to="${3}"

for i in $(eval "echo {${from}..${total}}")
do
    sed -i "s#pageno=\"${i}\"#pageno=\"${to}_new\"#g" "${1}"
    ((to += 1))
done

_new will prevent two occurences of same page no, I will delete it later on.

harshit
  • 11
  • 3
  • Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Sep 18 '22 at 20:01

1 Answers1

1

Assumptions:

  • all data is nicely formatted as in OP's example (otherwise OP may want to look at a tool specifically designed for processing HTML/XML formatted fields)
  • there is at most one instance of pageno="####" on any line of input
  • awk is an acceptable solution

One awk idea:

awk -v pgno=62 '
sub("pageno=\"" pgno "\"","pageno=\"" pgno+1 "\"") { pgno++ }  # attempt replacement on current line and if successful then increment pgno for the next search-n-replace
1                                                              # print current line
' pages.dat

This generates:

<page width="595.28000000" height="841.89000000">
<background type="pdf" pageno="61"/>
<layer/>
</page>
<page width="595.28000000" height="841.89000000">
<background type="pdf" pageno="63"/>
<layer/>
</page>
<page width="595.28000000" height="841.89000000">
<background type="pdf" pageno="64"/>
<layer/>

This should be relatively fast since it requires just a single OS level call (awk) and requires a single pass through the input file.

If the results look good and you're using GNU awk you can use the -i inplace option to update the file in place ...

awk -i inplace -v pgno=62 '
sub("pageno=\"" pgno "\"","pageno=\"" pgno+1 "\"") { pgno++ }
1
' pages.dat

... otherwise you can write the output to a temp file and then rename/mv accordingly.

markp-fuso
  • 28,790
  • 4
  • 16
  • 36