1

I'm debranding a micro-site to use as a portfolio piece. It's built with static html, I need to replace the contents of every non-script tag with lipsum or even scrambled text - but it has to be the same number of characters as the current text to keep the formatting nice. Furthermore, I really would rather do this with GUI grep editor rather than writing a script because there may be a few tags I need to keep the contents of.

I used the regex \>([^$]+?)\< to find them (all the scripts start with $ so it skips the script tag) but I can't find any way to count the number of characters matched and replace with a corresponding number of lipsum or random characters.

Thanks for any help!

NealJMD
  • 864
  • 8
  • 15
  • 4
    http://stackoverflow.com/questions/6687305/reliably-parsing-html-elements-using-regex#comment-7912433 `->` you can't reliably replace only text in HTML with Regex. Highest-upvoted duplicate of thousands: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Nicole Jul 14 '11 at 17:09

1 Answers1

1

I was able to successfully do this, though I had to end up using a Java program. Turns out regex is fine cause I'm not parsing the whole thing, just a few parts. There are a few quirks but this got the job done.

public class Debrander {

public static void main(String[] args) {

       // reads in html from StdIn
       String htmlPage = StdIn.readAll();

       //regex matches all content within non-script non-style tags
       Pattern tagContentRegex = Pattern.compile("\\>(.*?)\\<(?!/script)(?!/style)");
       Matcher myMatcher = tagContentRegex.matcher(htmlPage);

       //different regex to check for whitespace
       Pattern whiteRegex = Pattern.compile("[^\\s]");

       StringBuffer sb = new StringBuffer();

       LoremIpsum4J loremIpsum = new LoremIpsum4J();
       loremIpsum.setStartWithLoremIpsum(false);

       //loop through all matches
       while(myMatcher.find()){
           String tagContent = htmlPage.substring(myMatcher.start(1), myMatcher.end(1));
           Matcher whiteMatcher = whiteRegex.matcher(tagContent);
           //whiteMatcher makes sure there is a NON-WHITESPACE character in the string
           if (whiteMatcher.find()){
               Integer charCount = (myMatcher.end(1) - myMatcher.start(1));

               String[] lipsum = loremIpsum.getBytes(charCount);
               String replaceString = ">";

               for (int i=0; i<lipsum.length; i++){
                   replaceString += lipsum[i];
               }
               replaceString += "<";
               myMatcher.appendReplacement(sb, replaceString);
           }
       }
       myMatcher.appendTail(sb);
       StdOut.println(sb.toString());
   }

}
NealJMD
  • 864
  • 8
  • 15
  • That you were able to do it does not make it a good idea. – eykanal Jul 19 '11 at 13:34
  • Sure it's not ideal but I don't see where the failing is - it's not as though this has security flaws or it's a script that will live on a website to wear out or slow down. It's a single use tool that does its job effectively. From a theoretical computer science perspective maybe it's not a good idea but from a practical standpoint the language tier barrier doesn't seem to effect this particular application. – NealJMD Jul 19 '11 at 15:09
  • The problem is that it is guaranteed to not work all the time. Yes, as a single-use tool it happened to work in your case. However, next time it just as likely won't, while using a DOM parser will work every time, and is pretty easy to do. – eykanal Jul 19 '11 at 15:15
  • Great stuff, I'm going to see if I can make something similar :) – Luke Nov 13 '17 at 23:10