0

I'm writing a script which can parse an HTML document. I would like to remove two lines, how does sed work with newlines? I tried

sed 's/<!DOCTYPE.*\n<h1.*/<newstring>/g'

which didn't work. I tried this statement but it removes the whole document because it seems to remove all newlines:

sed ':a;N;$!ba;s/<!DOCTYPE.*\n<h1.*\n<b.*/<newstring>/g'

Any ideas? Maybe I should work with awk?

Klausi
  • 11
  • 2
  • The second one removes as many text as possible including newlines because `.*` is "greedy" (POSIX regexps do not support lazy/non-greedy quantifiers) and `.` matches any chars including newlines in a POSIX regex. – Wiktor Stribiżew Feb 18 '21 at 10:02
  • Try it with sed -z – Raman Sailopal Feb 18 '21 at 10:02
  • `I'm writing a script which can parse an HTML document` --> using `sed` isn't recommended for this. Use tools like `xmlstarlet` or programming languages which have libraries to parse xml/html. If you must use `sed/awk/perl` and have to match these patterns across entire lines, see https://stackoverflow.com/questions/38972736/how-to-print-lines-between-two-patterns-inclusive-or-exclusive-in-sed-awk-or – Sundeep Feb 18 '21 at 10:11
  • Sundeep, sorry, I can't use xmlstarlet for this task. I need to remove 2-3 lines starting with certain strings. It works well in EMACS but I want to do it in a script. – Klausi Feb 18 '21 at 10:21
  • `I would like to remove two lines` which two lines? `I need to remove 2-3 lines starting with certain strings` so 2 or 3 lines? Starting from which string exactly? – KamilCuk Feb 18 '21 at 11:24

4 Answers4

1

For the simple task of removing two lines if each matches some pattern, all you need to do is:

sed '/<!DOCTYPE.*/{N;/\n<h1.*/d}'

This uses an address matching the first line you want to delete. When the address matches, it executes:

  • Next - append the next line to the current pattern-space (including \n)

Then, it matches on an address for the contents of the second line (following \n). If that works it executes:

  • delete - discard current input and start reading next unread line

If d isn't executed, then both lines will print by default and execution will continue as normal.

To adjust this for three lines, you need only use N again. If you want to pull in multiple lines until some delimiter is reached, you can use a line-pump, which looks something like this:

/<!DOCTYPE.*/{
    :pump
    N
    /some-regex-to-stop-pump/!b pump
    /regex-which-indicates-we-should-delete/d
}

However, writing a full XML parser in sed or awk is a Herculean task and you're likely better off using an existing solution.

Qualia
  • 700
  • 8
  • 15
  • You can write specific commands like this to delete as many lines as you want. You just have to use `N` repeatedly. I'll adjust my answer as I realise you wanted to match on subsequent lines... – Qualia Feb 18 '21 at 12:47
0

If an xml parsing tool is definitely not an option, awk maybe an option:

awk '/<!DOCTYPE/ { lne=NR+1;next } NR==lne && /<h1/ { next }1' file

When we encounter a line with "<!DOCTYPE" set the variable lne to the line number + 1 (NR+1) and then skip to the next line. Then when the line is equal to lne (NR==lne) and the line contains "<h1", skip to the next line. Print all other lines by using 1.

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
0

My solution for a document like this:

<b>...
<first...
<second...
<third...
<a ...

this awk command works well:

awk -v RS='<first[^\n]*\n<second[^\n]*\n<third[^\n]*\n' '{printf "%s", $0}'     

that's all.

Klausi
  • 11
  • 2
0

This might work for you (GNU sed):

sed 'N;/<!DOCTYPE.*\n<h1.*/d;P;D' file

Append the following line and if the pattern matches both lines in the pattern space delete them.

Otherwise, print then delete the first of the two lines and repeat.

To replace the two lines with another string, use:

sed 'N;s/<!DOCTYPE.*\n<h1.*/another string/;P;D'
potong
  • 55,640
  • 6
  • 51
  • 83