0

I need to replace large swaths of HTML in a 5MB file, and all the OS X editors I've tried hang on attempting this. sed seems to be the answer, but writing the correct command is failing me. I've been at this 3 hours, and finally asking for help!

Here's an example - all of this

</div><div class="fsm fwn fcg">Joined<br>Added by **Tiffany Seibel-Howard** on <abbr title="**Thursday, June 20, 2013 at 12:39am**" data-utime="**1371703149**"><span class="timestampContent">**June 20, 2013**</span></abbr></div></div><div class="_4bl7 mrm"></div></div></div></div></div></div></div></td><td class="_51m- vTop hLeft pam _51mw"><div class="_4-u2 _4-u8" data-name="GroupProfileGridItem" data-testid="GroupMember_**100002558935125**"><div class="clearfix"><a class="_8o _8r lfloat _ohe" href="**https://www.facebook.com/brookesblossoms?fref=grp_mmbr_list**" tabindex="-1" aria-hidden="true" data-hovercard="/ajax/hovercard/user.php?**id=100002558935125&amp;extragetparams=%7B%22fref%22%3A%22grp_mmbr_list%22%2C%22directed_target_id%22%3A479810992099587%7D**" data-hovercard-prefer-more-content-show="1"><img class="_s0 _rv img" src="./(2) Neuroblastoma Support group . You are Not Alone Ask Away._files/**10374531_827398764022080_7090816591123160699_n.jpg**" alt=""></a><div class="_8u _42ef"><div class="_6a _5u5j"><div class="_6a _6b" style="height:100px"></div><div class="_6a _5u5j _6b"><div class="fsl fwb fcb">

Needs to be deleted, any time it shows up in the file.

Any of the pieces in there between ** and ** are wildcards that will change throughout the file.

Help!

1 Answers1

0

The problem with what you are trying to do is that you have text containing regexp metacharacters (e.g. ?) that you need to treat as literal but you also have literal text that you need to convert to regexp metacharacters (e.g. >**June 20, 2013**< -> >[^<]+<). To do that, you're going to want to start by first figuring out how to express the parts of your text that are changeable as uniquely descriptive strings, e.g.:

</div><div class="fsm fwn fcg">Joined<br>Added by _NOT_LESS_THAN_ on <abbr title="_NOT_DOUBLE_QUOTE_" data-utime="_NOT_DOUBLE_QUOTE_"><span class="timestampContent">_NOT_LESS_THAN_</span></abbr></div></div><div class="_4bl7 mrm"></div></div></div></div></div></div></div></td><td class="_51m- vTop hLeft pam _51mw"><div class="_4-u2 _4-u8" data-name="GroupProfileGridItem" data-testid="GroupMember__NOT_DOUBLE_QUOTE_"><div class="clearfix"><a class="_8o _8r lfloat _ohe" href="_NOT_DOUBLE_QUOTE_" tabindex="-1" aria-hidden="true" data-hovercard="/ajax/hovercard/user.php?_NOT_DOUBLE_QUOTE_" data-hovercard-prefer-more-content-show="1"><img class="_s0 _rv img" src="./(2) Neuroblastoma Support group . You are Not Alone Ask Away._files/_NOT_DOUBLE_QUOTE_" alt=""></a><div class="_8u _42ef"><div class="_6a _5u5j"><div class="_6a _6b" style="height:100px"></div><div class="_6a _5u5j _6b"><div class="fsl fwb fcb">

then sanitize all regexp metacharacters in the text (see Is it possible to escape regex metacharacters reliably with sed) then convert the placeholder strings you used above to regexps:

_NOT_LESS_THAN_    -> [^<]+
_NOT_DOUBLE_QUOTE_ -> [^"]+

and then you can run sed -E to delete the text.

You might be better off with GNU awk though so you can set RS to the above text and that way you don't have to read the whole file into memory at one time.

Community
  • 1
  • 1
Ed Morton
  • 188,023
  • 17
  • 78
  • 185