0

Say I have a big XML dictionary formatted like so:

<entry>
<!-- arbitrary amount of lines -->
<head>SomeWord</head>
<!-- arbitrary amount of lines -->
</entry>

And assume I know that SomeWord is on line 3,026,138. I would like to search backwards from line 3,026,138 up until <entry>, but I don't know how many lines there are between <entry> and my target line.

This answer works properly if I use the line number rather than a pattern, as follows

sed '/<entry>/h;//!H;3026138!d;x;q' file

However, this is a somewhat suboptimal solution, as I think sed is scanning from line 0 and crawling through the file for 3 million lines. This seems wasteful, since I already know which area of the file I want to be working in. All in all it takes about half a second.

Does anyone have a solution that capitalizes on the fact that I am aware of the line number, that uses normal Unix/sh programs that everyone already has (such as grep, awk, sed, and so on)?

Note: please do not suggest I use something like xmllint. Not only is it extremely slow, but I'd also like this to be a meta-format-agnostic script.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
Sam G
  • 189
  • 1
  • 1
  • 10
  • Do not operate on (variable length) line numbers. Operate on file position (number of bytes from file begin), – AnFi Feb 28 '20 at 04:48

2 Answers2

1

The problem with tools like sed is that they process a line at a time, when you want to process a big chunk of the file as a whole. Enter ed. The following prints everything between the first line with <entry> found before line 3026138 to that line:

echo "3026138;?<entry>?,.p" | ed -s file

(Sets the current line to line 3026138, prints the range between the first match of <entry> before the current line to the current line. If you want to save the chunk in another file, you can use w foo.txt instead of p).

Example using your sample file and line 3:

$ echo "3;?<entry>?,.p" | ed -s input.txt
<entry>
<!-- arbitrary amount of lines -->
<head>SomeWord</head>
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • thank you! I hadn't really thought about the use of line editors before and I'm not sure why. – Sam G Mar 01 '20 at 03:10
0

Here I tried following:

  1. Save entry tag line numbers into separate file
  2. specifying the desired line number of head tag
  3. performing the search "where does it fit"

Input file:

someline
someline
<entry>
someline
someline
<head>Here</head>
someline
</entry>
someline
<entry>
someline
<head>Another</head>
someline
someline
someline
</entry>
someline
someline

shell script (Could be separated to perform searches on given ($1) line number. To perform multiple searches on the file or use it in various ways (getting desired tag via different approaches and then giving the line number to the search script to perform the search)

# preparation before doing searches
 ln=12 # line number with desired <head>
 cat input.txt | sed '$a<entry>' | grep -n '^<entry>' | cut -d ':' -f1 > entryl.txt
# doing searches
 t=0
 for x in $(seq $(cat entryl.txt | wc -l)); do
  c=$(cat entryl.txt | head -n $x | tail -n 1)
  if test $t -eq 1; then
   if test $ln -lt $c; then
    echo "<head> tag on line: $ln"
    echo "Previous <entry> found at: $p"
    echo "Next <entry> found at: $c"
    break;
   else
    p=$c
   fi
  else
   if test $ln -gt $c; then
    p=$c; t=1
   fi
  fi
 done

Sample output:

<head> tag on line: 12
Previous <entry> found at: 10
Next <entry> found at: 19