0

(I apologize for the vague title. If someone has a better wording, please let me know.)

My question is about a function I wish to implement with sed that showed up again and again. Currently I have a solution, but it is ugly and destroys some format. I shall describe them below.

Question

Usually I have to handle a file like this

.
.
<pattern A>
.
.
<pattern B>
.. <pattern B1>
..
.. <pattern B2>
..
.. <pattern B3>
<pattern B>
.
.
<pattern A>
<pattern B>
.
.

I usually find that I would like to focus on every thing between/out-of <pattern A>, or to focus on

<pattern B>
.. <pattern B1>
..
.. <pattern B2>
..
.. <pattern B3>
<pattern B>

by ignoring specific <pattern B> in the whole file.

Is there any elegant way to do this with sed?

Concrete Example

1.

From the file

<html>
<div>
1st div
</div>
<div>
2nd div
</div>
..

<div>
10th div
</div>
</html>

how to extract

<div>
3rd div
.
.
7th div
</div>

2.

From the file

<html>
.
.
<ol> # the first <ol> in the whole file
.
.
</ol> # the last </ol> in the whole file
.

How to extract

<ol> # the first <ol> in the whole file
.
.
</ol> # the last </ol> in the whole file

What I've tried

My current solution is very ugly and non-robust. I simply delete all newlines, making the whole file a one-liner, and do lots of ugly sed-magic.. Fortunately, in my case I can usually input the newlines back.. but this is definitely not the right way.

Please let me know if further information should be provided. I know it's somehow a vague question, but that's exactly I want.. Can sed detect patterns in the whole file like this? I appreciate your help in advance!

Student
  • 400
  • 3
  • 10
  • 1
    Please make a specific example, with specific data, and a specific thing to do. I don't mind the vague title, but with all the "pattern A/pattern B" and "between/out-of" and "focus/ignore", I have no idea what you're trying to do. – Amadan Jul 05 '19 at 01:05
  • I've added two concrete examples – Student Jul 05 '19 at 01:18
  • Obligatory [Zalgo link](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). If your files are kind of regular, like you have one tag per line and there's no similar elements to interfere, it may be possible. However, "3rd div" is not found in your first concrete example, so you can't extract it from there. (I did say "concrete example" - don't _dot dot dot_ stuff. The way you posted, we can't see the structure of it, and relationship between the input and output unambiguously. It doesn't have to be your real data, but it should _look like_ it.) – Amadan Jul 05 '19 at 01:25
  • Thank you @Amadan for your suggestion! Should I keep editing..? – Student Jul 05 '19 at 01:28
  • 2
    But it would really, _really_ be good to use a proper DOM parser, in whatever language you're familiar with (Ruby, Python, Perl, JavaScript, PHP...). – Amadan Jul 05 '19 at 01:28
  • That's a new terminology to me! I will surely look into it! – Student Jul 05 '19 at 01:30
  • 1
    For example, getting the contents of the first `
      ` is as easy as `ruby -rnokogiri -e 'puts Nokogiri::HTML(STDIN).at_css("ol").to_xml' < test.html` in Ruby (with Nokogiri gem).
    – Amadan Jul 05 '19 at 01:37
  • 2
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Jul 05 '19 at 05:53

1 Answers1

1

This might work for you (GNU sed):

sed -nE '/<div>/{H;:a;n;H;/<\/div>/!ba;x;s/^/x/;/^x{3,7}\n/{H;s/^[^\n]*\n//p;g;s///;s/\n.*//;x;s///;b};s/\n.*//;x}' file

This prints only the 3rd to the 7th divs in the file. It uses the first line of the hold space as a counter and each time it encounters a div in the file appends it to the hold space, increments the counter and decides if to print or not print the div present. The same mechanism can be used to print all divs, using:

sed -nE '/<div>/{H;:a;n;H;/<\/div>/!ba;x;s/^/x/;/^x{1,}\n/{H;s/^[^\n]*\n//p;g;s///;s/\n.*//;x;s///;b};s/\n.*//;x}' file
potong
  • 55,640
  • 6
  • 51
  • 83