0

I want to replace part of a file that matches a regexp. The point is, that it has to work over whole file as a single string like grep -Pzo, but, as far as I know, sed is line-based.

I have tried to force sed to do this by manipulating IFS, but I am still inexperienced in bash and I am not really sure about what I'm doing.
I hope you will help me clarify some things that I don't understand.

So I made something like this:

 OIFS=$IFS
 IFS=""
 content=$(cat -v file  | sed 's/(?<=<\/div>(?!.*\/div>)).*//') 
 #Remove everything begining from last </div> to the end of file.
 IFS=$OIFS

But I doesn't work as I intended. I was also experimenting with perl to make this substitution, but the problem seems to be the same.
I will appreciate any tips.

EDIT: According to comments below I am pasting some example data:

 Source:
    <html>
    <body>
    <div>
        some site with many <div> divs </div>
           <div> and more <div> even more </div> </div>
    </div> <!-- last div closing -->
    This is all to be deleted
    </body>
    </html>

Then after: s/</div>(?<=<\/div>(?!.*\/div>)).*//s

<html>
<body>
<div>
    some site with many <div> divs </div>
       <div> and more <div> even more </div> </div>


EDIT 2: I found yet simpler way than suggested below:

cat file | perl -0pe 's/(?<=<\/div>(?!.*\/div>)).*//'

-0 causes record separator to be null, which makes perl to process whole string in one run instead of looping through lines.

Liberat0r
  • 1,852
  • 2
  • 16
  • 21

3 Answers3

3

You could do this by reversing your input file, deleting everything until the first </div> and then reversing again:

tac input.txt | sed '1,/<\/div>/d' | tac > output.txt

This will remove the last line which contains a </div>and everything after it.

Alternative with sed (although not pretty, and I'm sure there is a cleverer way to do it):

tr '\n' '~' < input.txt | sed -r 's~(.*)</div>.*~\1~g' | tr '~' '\n' > output.txt

Replace newlines with a placeholder (~ in this example) so everything is on one line, match that line up until the last </div>, then replace the newlines again. Choose a placeholder according to your input data, obviously it should be something which does not occur.

Josh Jolly
  • 11,258
  • 2
  • 39
  • 55
  • That should do. I have tried a simmilar approach, but I was using `rev`, which caused the same problem that I had with `sed`. I didn't know that the `tac` exists. Thanks. – Liberat0r Mar 20 '14 at 11:17
  • I am still curious if I can make my previous code working. This time the `tac` is fine, but it is easy to imagine that this will not be sufficient. – Liberat0r Mar 20 '14 at 11:29
  • 1
    Added an additional way to do it. – Josh Jolly Mar 20 '14 at 11:50
3

Here is a more general solution:

$ cat file | tr '\n' '\r' | sed 's,\(.*</div>\).*,\1,' | tr '\r' '\n'
<html>
  <body>
    <div>
      some site with many <div> divs </div>
      <div> and more <div> even more </div> </div>
    </div>

Explanation:

tr '\n' '\r' replaces newlines by carriage returns so sed will treat the file content as one line.

sed 's,\(.*</div>\).*,\1,' removes everything beyond the last match of </div>.

tr '\r' '\n' replaces the remaining carriage returns by newlines.

Note: if your original file contains windows-style \r\n newlines, first convert to unix style newlines:

$ cat file | dos2unix | tr '\n' '\r' | sed 's,\(.*</div>\).*,\1,' | tr '\r' '\n' | unix2dos
Gerrit Brouwer
  • 732
  • 1
  • 6
  • 14
  • and if we want to match a "whole line" just instead of ^$ enclose with `\r`, seems a good trick, thx! :D, we just need to make it sure there wasnt any \r before beggining, but probably would be a windows text file, so just remove them all b4 beggining and re-add after if thats the case. – Aquarius Power Nov 15 '16 at 01:44
0

Some like this awk

awk '/<\/div>/ {exit} 1' file

This will exit when pattern found.

Jotne
  • 40,548
  • 12
  • 51
  • 55
  • This is not exactly what I intended. The goal is to remove everything begining from the LAST in file. But thanks. – Liberat0r Mar 20 '14 at 11:15
  • 1
    @Liberat0r That is why people ask you to post example data. Updated my answer. – Jotne Mar 20 '14 at 11:17
  • I assumed, that if someone is not able to understand given regexp, will not be able to help me anyway :P. – Liberat0r Mar 20 '14 at 11:23
  • 1
    @Liberat0r If you look, you see anubhava, a senior here in the forum is asking for that. And two other upvoted his reply. So it might not always be easy to understand what is asked. – Jotne Mar 20 '14 at 11:29