-1

How can I use regex to find everything except for data within div with a specific style? e.g.

<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>
<div style="float:left; padding-top:5px;">
    Data to keep
</div>
<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>

I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)

I'm using regex because I need to extract the data from several hundred files.

Rob
  • 5,223
  • 5
  • 41
  • 62
sasdev
  • 506
  • 7
  • 23

1 Answers1

1

Your suggested method is probably not a good way to do this. If:

  • you have access to grep
  • your version of grep supports perl-compatible regex (PCRE)
  • this style of div only wraps your data, not other elements
  • the 'data' div does not contain other divs

Then you can use:

(?s)<div style="float:left; padding-top:5px;">.*?</div>

The important parts of this are:

  • (?s) which activates DOTALL, which means that . will match newlines
  • .*? which matches the contents of the div reluctantly, which means it'll stop at the first </div> it finds.

To use this, you'll need to activate a few grep options:

grep -Pzo $PATTERN file

For these:

  • -P activates the PCRE
  • -z replaces \n by NUL so grep will treat the entire file as a single line
  • -o prints only the matching parts

After this you'll need to strip off the divs. sed is a good tool for this.

sed 's|</\?div[^>]*>||g'

If you put all of your files in one directory you can do the joining at the same time:

grep -Pzo $PATTERN *.html | sed 's|</\?div[^>]*>||g' > out.html
beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • @sas Welcome to SO! A common way to say thanks here is by up-voting the answer, and accepting by clicking the check-mark (if it's the best answer.) – Rob Feb 28 '12 at 20:28