9

I have a sed command that I want to run on a huge, terrible, ugly HTML file that was created from a Microsoft Word document. All it should do is remove any instance of the string

style='text-align:center; color:blue;
exampleStyle:exampleValue'

The sed command that I am trying to modify is

sed "s/ style='[^']*'//" fileA > fileB

It works great, except that whenever there is a new line inside of the matching text, it doesn't match. Is there a modifier for sed, or something I can do to force matching of any character, including newlines?

I understand that regexps are terrible at XML and HTML, blah blah blah, but in this case, the string patterns are well-formed in that the style attributes always start with a single quote and end with a single quote. So if I could just solve the newline problem, I could cut down the size of the HTML by over 50% with just that one command.


In the end, it turned out that Sinan Ünür's perl script worked best. It was almost instantaneous, and it reduced the file size from 2.3 MB to 850k. Good ol' Perl...

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Cory McHugh
  • 93
  • 1
  • 2
  • 6
  • sed is line based. That's the major stopping point here. If you use the /g regex modifier, there might be a command line option to get it to read the file as a single 'line', but I doubt it (memory issues and the like) – Matthew Scharley Jul 22 '09 at 12:39
  • There's no option (that I know of) for reading a file as a single line. I would use Perl for this. – Dana Jul 22 '09 at 12:42
  • But sed does have means to append new lines into the pattern space and the hold space, so it is possible to do multi-line processing in sed - it is just not pretty. – Beano Jul 22 '09 at 13:05
  • (I merged your answer into the question; if Sinan's reply answered your problem, then click the "tick" to mark it as answered) – Marc Gravell Jul 24 '09 at 13:32

6 Answers6

4

sed goes over the input file line by line which means, as I understand, what you want is not possible in sed.

You could use the following Perl script (untested), though:

#!/usr/bin/perl

use strict;
use warnings;

{
    local $/; # slurp mode
    my $html = <>;
    $html =~ s/ style='[^']*'//g;
    print $html;
}

__END__

A one liner would be:

$ perl -e 'local $/; $_ = <>; s/ style=\047[^\047]*\047//g; print' fileA > fileB
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
4

Sed reads the input line by line, so it is not simple to do processing over one line... but it is not impossible either, you need to make use of sed branching. The following will work, I have commented it to explain what is going on (not the most readable syntax!):

sed "# if the line matches 'style='', then branch to label, 
     # otherwise process next line
     /style='/b style
     b
     # the line contains 'style', try to do a replace
     : style
     s/ style='[^']*'//
     # if the replace worked, then process next line
     t
     # otherwise append the next line to the pattern space and try again.
     N
     b style
 " fileA > fileB
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Beano
  • 7,551
  • 3
  • 24
  • 27
1

You could remove all CR/LF using tr, run sed, and then import into an editor that auto-formats.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
kmarsh
  • 1,388
  • 8
  • 21
1

You can try this:

awk '/style/&&/exampleValue/{
    gsub(/style.*exampleValue\047/,"")
}
/style/&&!/exampleValue/{     
    gsub(/style.* /,"")
    f=1        
}
f &&/exampleValue/{  
  gsub(/.*exampleValue\047 /,"")
  f=0
}
1
' file

Output:

# more file
this is a line
    style='text-align:center; color:blue; exampleStyle:exampleValue'
this is a line
blah
blah
style='text-align:center; color:blue;
exampleStyle:exampleValue' blah blah....

# ./test.sh
this is a line

this is a line
blah
blah
blah blah....
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • This is my vote for the answer. The progression of languages is sed -> awk -> C/C++/Ada. Start at the left and move right until you have enough power to do the job. – T.E.D. Jul 22 '09 at 13:28
  • may not be c/C++/Ada. IMO, maybe Python/Perl/Ruby etc, at least for sysadmin tasks. – ghostdog74 Jul 22 '09 at 13:36
1

Another way is like:

$ cat toreplace.txt 
I want to make \
this into one line

I also want to \
merge this line

$ sed -e 'N;N;s/\\\n//g;P;D;' toreplace.txt 

Output:

I want to make this into one line

I also want to merge this line

The N loads another line, P prints the pattern space up to the first newline, and D deletes the pattern space up to the first newline.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
0

Remove XML elements across several lines

My use case was pretty much the same, but I needed to match opening and closing tags from XML elements and remove them completely --including whatever was inside.

<xmlTag whatever="parameter that holds in the tag header">
    <whatever_is_inside/>
    <InWhicheverFormat>
        <AcrossSeveralLines/>
    </InWhicheverFormat>
</xmlTag>

Still, sed works on one single line. What we do here is tricking it to append subsequent lines to the current one so we can edit all lines we like, then rewrite the output (\n is a legal char you can output with sed to divide lines again).

Inspired by the answer from @beano, and another answer in Unix stackExchange, I built up my working sed "program":

 sed -s --in-place=.back -e '/\(^[ ]*\)<xmlTag/{  # whenever you encounter the xmlTag
       $! {                                       # do
            :begin                                # label to return to
            N;                                    # append next line
            s/\(^[ ]*\)<\(xmlTag\)[^·]\+<\/\2>//; # Attempt substitution (elimination) of pattern
            t end                                 # if substitution succeeds, jump to :end
            b begin                               # unconditional jump to :begin to append yet another line
            :end                                  # label to mark the end
          }
       }'  myxmlfile.xml

Some explanations:

  • I match <xmlTag without closing the > because my XML element contains parameters.
  • What precedes <xmlTag is a very helpful piece of RegExp to match any existing indentation: \(^[ ]*\) so you can later output it with just \1 (even if it was not needed this time).
  • The addition of ; in several places is so that sed will understand that the command (N, s or whichever) ends there and following character(s) are another command.
  • most of my trouble was trying to find a RegExp that would match "anything in between". I finally settled by anything but · (i.e. [^·]\+), counting on not having that char in any of the data files. I needed to scape + because is special for GNU sed.
  • my original files remain as .back, just in case something goes wrong --tests still do fail after modification-- and are flagged easily by version control for removal in bulk.

I use this kind of sed-automation to evolve .XML files that we use with serialized data to run our unit and Integration tests. Whenever our classes change (loose or gain fields), the data have to be updated. I do that with a single ´find´ that executes a sed-automation in the files that contain the modified class. We hold hundreds of xml data files.

manuelvigarcia
  • 1,696
  • 1
  • 22
  • 32