0

I have over a gigabyte of text that I need to go through and surround punctuation with spaces (tokenizing). I have a long regular expression (1818 characters, though that's mostly lists) that defines when punctuation should not be separated. Being long and complicated makes it hard to use groups with it, though I wouldn't leave that out as an option since I could make most groups non-capturing (?:).

Question: How can I efficiently replace certain characters that don't match a particular regular expression?

I've looked into using lookaheads or similar, and I haven't quite figured it out, but it seems to be terribly inefficient anyway. It would likely be better than using placeholders though. I can't seem to find a good "replace with a bunch of different regular expressions for both finding and replacing in one pass" function.

Should I do this line by line instead of operating on the whole text?

String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
    protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);

//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");

// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");

Note the two additional replacements for apostrophes. Using placeholders protects against those replacements as well, but I'm not really concerned with apostrophes or single quotes in my protecting regex anyway, so it's not a real concern.

I'm rewriting what I considered very inefficient Perl code with my own in Java, keeping track of speed, and things were going fine until I started replacing the placeholders with the original strings. With that addition it's too slow to be reasonable (I've never seen it get even close to finishing).

//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();

int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
    int pt = strArray[i];
//          System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
    if (protectedArray[currentPos]==pt) {
        if (currentPos == protectedLen - 1) {
            resultStr += protectedStrs.get(protectedCount);
            protectedCount++;
            currentPos = 0;
        } else {
            currentPos++;
        }
    } else {
        if (currentPos > 0) {
            resultStr += replaceStr.substring(0, currentPos);
            currentPos = 0;
            currentStr = "";
        }
        resultStr += ParseUtils.getSymbol(pt);
    }

}
s = resultStr;

This code may not be the most efficient way to return the protected matches. What is a better way? Or better yet, how can I replace punctuation without having to use placeholders?

Joshua M
  • 401
  • 4
  • 9
  • 1
    Have you considered [Matcher.appendReplacement](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-) – Patrick Parker Feb 17 '17 at 19:17
  • "I have over a gigabyte of text" then you don't want to be doing `s = replaceThingsIn(s)` over and over, because that creates a new gigabyte-sized string each time. – Andy Turner Feb 17 '17 at 19:17
  • @PatrickParker It turns out that appendReplacement is what I needed. It took me a bit to realize how it applies. – Joshua M Feb 17 '17 at 21:59
  • @AndyTurner I agree. That's why I narrowed it down to the few replacements that I have (I don't know how to have less). However, it turns out that the replaceAll is actually pretty fast. It appears that it's the other things I was doing in loops (substring, converting unicode to symbols, etc.) that was taking a long time. – Joshua M Feb 17 '17 at 22:15

2 Answers2

1

At first I thought that appendReplacement wasn't what I was looking for, but indeed it was. Since it's replacing the placeholders at the end that slowed things down, all I really needed was a way to dynamically replace matches:

StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
    replaceM.appendReplacement(replacedBuff, "");
    replacedBuff.append(protectedStrs.get(index));
    index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();

Reference: Second answer at this question.

Another option to consider: During the first pass through the String, to find the protected Strings, take the start and end indices of each match, replace the punctuation for everything outside of the match, add the matched String, and then keep going. This takes away the need to write a String with placeholders, and requires only one pass through the entire String. It does, however, require many separate small replacement operations. (By the way, be sure to compile the patterns before the loop, as opposed to using String.replaceAll()). A similar alternative is to add the unprotected substrings together, and then replace them all at the same time. However, the protected strings would then have to be added to the replaced string at the end, so I doubt this would save time.

int currIndex = 0;
while (protectedM.find()) {
    protectedStrs.add(protectedM.group());
    String substr = s.substring(currIndex,protectedM.start());
    substr = p1.matcher(substr).replaceAll(" $1 ");
    substr = p2.matcher(substr).replaceAll("$1 '$2");
    substr = p3.matcher(substr).replaceAll("$1 ' $2");
    resultStr += substr+protectedM.group();
    currIndex = protectedM.end();
}

Speed comparison for 100,000 lines of text:

  • Original Perl script: 272.960579875 seconds
  • My first attempt: Too long to finish.
  • With appendReplacement(): 14.245160866 seconds
  • Replacing while finding protected: 68.691842962 seconds

Thank you, Java, for not letting me down.

Community
  • 1
  • 1
Joshua M
  • 401
  • 4
  • 9
  • 2
    don't forget to `appendTail` after the loop – Patrick Parker Feb 17 '17 at 22:05
  • Does it speed up if you use `StringBuilder` instead of `StringBuffer`? That call to `appendReplacement` might confound the escape analysis that determines synchronization isn't necessary; so it could be slower. – Andy Turner Feb 17 '17 at 22:21
  • @AndyTurner I would try, but appendReplacement doesn't work with StringBuilder. StringBuilder might speed up the other substring option though. – Joshua M Feb 17 '17 at 22:27
  • @JoshuaMathias oh well, roll on [Java 9](http://download.java.net/java/jdk9/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuilder-java.lang.String-). – Andy Turner Feb 17 '17 at 22:30
  • @JoshuaMathias but you would benefit from building `resultStr` in a `StringBuilder`/`StringBuffer`, rather than using string concatenation. – Andy Turner Feb 17 '17 at 22:36
1

I don't know exactly how big your in-between strings are, but I suspect that you can do somewhat better than using Matcher.replaceAll, speed-wise.

You're doing 3 passes across the string, each time creating a new Matcher instance, and then creating a new String; and because you're using + to concatenate the strings, you're creating a new string which is the concatenation of the in-between string and the protected group, and then another string when you concatenate this to the current result. You don't really need all of these extra instances.

Firstly, you should accumulate the resultStr in a StringBuilder, rather than via direct string concatenation. Then you can proceed something like:

StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
  protectedStrs.add(protectedM.group());
  appendInBetween(resultStr, str, current, protectedM.str());
  resultStr.append(protectedM.group());
  currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());

where appendInBetween is a method implementing the equivalent to the replacements, just in a single pass:

void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
  // Pass the whole input string and the bounds, rather than taking a substring.

  // Allocate roughly enough space up-front.
  resultStr.ensureCapacity(resultStr.length() + end - start);

  for (int i = start; i < end; ++i) {
    char c = s.charAt(i);

    // Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
    if (!(Character.isLetter(c)
          || Character.isDigit(c)
          || Character.getType(c) == Character.NON_SPACING_MARK
          || "_\\-<>'".indexOf(c) != -1)) {
      resultStr.append(' ');
      resultStr.append(c);
      resultStr.append(' ');
    } else if (c == '\'' && i > 0 && i + 1 < s.length()) {
      // We have a quote that's not at the beginning or end.
      // Call these 3 characters bcd, where c is the quote.

      char b = s.charAt(i - 1);
      char d = s.charAt(i + 1);

      if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
        // If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
        resultStr.append(' ');
        resultStr.append(c);
      } else if (!Character.isLetter(b) && !Character.isLetter(d)) {
        // If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
        resultStr.append(' ');
        resultStr.append(c);
        resultStr.append(' ');
      } else {
        resultStr.append(c);
      }
    } else {
      // Everything else, just append.
      resultStr.append(c);
    }
  }
}

Ideone demo

Obviously, there is a maintenance cost associated with this code - it is undeniably more verbose. But the advantage of doing it explicitly like this (aside from the fact it is just a single pass) is that you can debug the code like any other - rather than it just being the black box that regexes are.

I'd be interested to know if this works any faster for you!

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • Yes! Time comparison: appendInBetween: 6.403139092 seconds appendReplacement: 13-14 seconds – Joshua M Feb 21 '17 at 19:14
  • In the call to appendInBetween I think you mean currIndex instead of "current" and protectedM.start() instead of protectedM.str(). I did get it to work, but I changed s.charAt() to s.codePointAt(), because this is meant to eventually work for a large number of languages. I converted it to char for the '\' comparison, and I converted it to a String for appending. – Joshua M Feb 21 '17 at 19:18
  • One more addition: replace resultStr.append(str, currIndex, str.length()); with appendInBetween(resultStr, str, currIndex, str.length()); Otherwise, the end of the String isn't tokenized with spaces. – Joshua M Feb 21 '17 at 19:56
  • I take back the idea of converting to a String for appending. Use StringBuilder's appendCodePoint(). – Joshua M Feb 21 '17 at 20:57