Regexp flag DOTALL in sed or alternatives

Question

I want to replace part of a file that matches a regexp. The point is, that it has to work over whole file as a single string like grep -Pzo, but, as far as I know, sed is line-based.

I have tried to force sed to do this by manipulating IFS, but I am still inexperienced in bash and I am not really sure about what I'm doing.
I hope you will help me clarify some things that I don't understand.

So I made something like this:

 OIFS=$IFS
 IFS=""
 content=$(cat -v file  | sed 's/(?<=<\/div>(?!.*\/div>)).*//') 
 #Remove everything begining from last </div> to the end of file.
 IFS=$OIFS

But I doesn't work as I intended. I was also experimenting with perl to make this substitution, but the problem seems to be the same.
I will appreciate any tips.

EDIT: According to comments below I am pasting some example data:

 Source:
    <html>
    <body>
    <div>
        some site with many <div> divs </div>
           <div> and more <div> even more </div> </div>
    </div> <!-- last div closing -->
    This is all to be deleted
    </body>
    </html>

Then after: s/</div>(?<=<\/div>(?!.*\/div>)).*//s

<html>
<body>
<div>
    some site with many <div> divs </div>
       <div> and more <div> even more </div> </div>

EDIT 2: I found yet simpler way than suggested below:

cat file | perl -0pe 's/(?<=<\/div>(?!.*\/div>)).*//'

-0 causes record separator to be null, which makes perl to process whole string in one run instead of looping through lines.

May be you should be using `xmlstarlet` because of [this](http://stackoverflow.com/a/1732454/1422630)? — Aquarius Power, Nov 15 '16 at 00:22

Josh Jolly · Answer 1 · 2014-03-20T12:04:19.270

3

You could do this by reversing your input file, deleting everything until the first </div> and then reversing again:

tac input.txt | sed '1,/<\/div>/d' | tac > output.txt

This will remove the last line which contains a </div>and everything after it.

Alternative with sed (although not pretty, and I'm sure there is a cleverer way to do it):

tr '\n' '~' < input.txt | sed -r 's~(.*)</div>.*~\1~g' | tr '~' '\n' > output.txt

Replace newlines with a placeholder (~ in this example) so everything is on one line, match that line up until the last </div>, then replace the newlines again. Choose a placeholder according to your input data, obviously it should be something which does not occur.

edited Mar 20 '14 at 12:04

answered Mar 20 '14 at 11:09

Josh Jolly

11,258
2
39
55

That should do. I have tried a simmilar approach, but I was using `rev`, which caused the same problem that I had with `sed`. I didn't know that the `tac` exists. Thanks. – Liberat0r Mar 20 '14 at 11:17
I am still curious if I can make my previous code working. This time the `tac` is fine, but it is easy to imagine that this will not be sufficient. – Liberat0r Mar 20 '14 at 11:29
1

Added an additional way to do it. – Josh Jolly Mar 20 '14 at 11:50

score 3 · Accepted Answer · answered Mar 20 '14 at 11:57

Here is a more general solution:

$ cat file | tr '\n' '\r' | sed 's,\(.*</div>\).*,\1,' | tr '\r' '\n'
<html>
  <body>
    <div>
      some site with many <div> divs </div>
      <div> and more <div> even more </div> </div>
    </div>

Explanation:

tr '\n' '\r' replaces newlines by carriage returns so sed will treat the file content as one line.

sed 's,$.*</div>$.*,\1,' removes everything beyond the last match of </div>.

tr '\r' '\n' replaces the remaining carriage returns by newlines.

Note: if your original file contains windows-style \r\n newlines, first convert to unix style newlines:

$ cat file | dos2unix | tr '\n' '\r' | sed 's,\(.*</div>\).*,\1,' | tr '\r' '\n' | unix2dos

and if we want to match a "whole line" just instead of ^$ enclose with `\r`, seems a good trick, thx! :D, we just need to make it sure there wasnt any \r before beggining, but probably would be a windows text file, so just remove them all b4 beggining and re-add after if thats the case. — Aquarius Power, Nov 15 '16 at 01:44

Jotne · Answer 3 · 2014-03-20T11:18:39.623

0

Some like this awk

awk '/<\/div>/ {exit} 1' file

This will exit when pattern found.

edited Mar 20 '14 at 11:18

answered Mar 20 '14 at 11:07

Jotne

40,548
12
51
55

This is not exactly what I intended. The goal is to remove everything begining from the LAST in file. But thanks. – Liberat0r Mar 20 '14 at 11:15
1

@Liberat0r That is why people ask you to post example data. Updated my answer. – Jotne Mar 20 '14 at 11:17
I assumed, that if someone is not able to understand given regexp, will not be able to help me anyway :P. – Liberat0r Mar 20 '14 at 11:23
1

@Liberat0r If you look, you see anubhava, a senior here in the forum is asking for that. And two other upvoted his reply. So it might not always be easy to understand what is asked. – Jotne Mar 20 '14 at 11:29

Regexp flag DOTALL in sed or alternatives

3 Answers3