1

I am working with a large string that represents an html page and is processed then. What I do is following:

String data = <HTML PAGE CONTENT>;

// remove first/last appostrove
data = data.substring(1, data.length() - 1);
data = StringUtils.replace(data, "\\u003C", "<");
data = StringUtils.replace(data, "\\u003E", ">");
data = StringUtils.replace(data, "\\\"", "\"");
// the head html element is not needed, so I remove it beforehand
data = removeTag(data, "head", true);
// format the data if necessary in utf8 
// => necessary, otherwise I see unwanted characters in my data
data = cleanString(data);

// continue... here I only parse out a list of all relevant tags I'm interested in
// from here on I use a html parser, which is memory efficient...

Problem

For some people I get OOM exceptions, mostly somewhere in between my string process function, so I'm looking to improve them. I appreciate any suggestion that improves my code in memory efficiency (speed is not important!).

Functions

private static String removeTag(String html, String tag, boolean replaceWithEmpty) {
    String regex = "<" + tag + ">.*?</" + tag + ">";
    return StringUtils.replaceAll(html, regex, replaceWithEmpty ? "<" + tag + "></" + tag + ">" : "");
}

private static String cleanString(String s) {
    try {
        // Convert from Unicode to UTF-8
        byte[] utf8 = s.getBytes("UTF-8");
        // Convert from UTF-8 to Unicode
        s = new String(utf8, "UTF-8");
    } catch (UnsupportedEncodingException e) {
        L.e(e);
    }

    return s;
}

StringUtils

public class StringUtils {

    // compile each pattern once only!
    private static HashMap<String, Pattern> COMPILED_PATTERNS = new HashMap<>();

    private static Pattern getPattern(String regex) {
        if (COMPILED_PATTERNS.containsKey(regex)) {
            return COMPILED_PATTERNS.get(regex);
        }
        Pattern p = Pattern.compile(regex);
        COMPILED_PATTERNS.put(regex, p);
        return p;
    }

    public static Matcher match(String regex, String data) {
        Pattern p = getPattern(regex);
        return p.matcher(data);
    }

    public static String replace(final String str, final CharSequence searchChars, CharSequence replaceChars) {
        return str.replace(searchChars, replaceChars);
    }

    public static String replaceAll(final String str, final String regex, String replacement) {
        Pattern p = getPattern(regex);
        return p.matcher(str).replaceAll(replacement);
    }

    public static String findContentBetween(String content, String prefix, String postfix) {
        return findContentBetween(content, prefix, postfix, false);
    }

    public static String findContentBetween(String content, String prefix, String postfix, boolean searchEndFirst) {
        if (content == null || content.length() == 0) {
            return null;
        }

        if (searchEndFirst) {
            int index = content.indexOf(postfix);
            if (index >= 0) {
                int end = -1;
                int start = -1;
                String s;
                while (index >= 0) {
                    s = content.substring(index, index + 1);
                    if (s.equals("?")) {
                        end = index;
                    } else if (s.equals("/")) {
                        start = index + 1;
                    }
                    if (end != -1 && start != -1) {
                        break;
                    }

                    index--;
                }
                if (end > start && end >= 0) {
                    return content.substring(start, end);
                }
            }
        } else {
            int end;
            int start = content.indexOf(prefix);
            if (start > 0) {
                start += prefix.length();
                end = content.indexOf(postfix, start + 1);
                if (end > start) {
                    return content.substring(start, end);
                }
            }
        }
        return null;
    }
}
prom85
  • 16,896
  • 17
  • 122
  • 242

2 Answers2

4

This answer is addressing the issue when working with a general string. There are better solutions if you're working with HTML.

data = data.substring(1, data.length() - 1);
data = StringUtils.replace(data, "\\u003C", "<");
data = StringUtils.replace(data, "\\u003E", ">");
data = StringUtils.replace(data, "\\\"", "\"");

String is immutable, so each of these strings is necessarily creating a new String (or, it's not doing anything). So, if each of these lines is largely leaving the string unchanged, you're basically just making copies of that string.

Instead, accumulate the updated string in a StringBuilder, doing all of the replacements in one go:

StringBuilder sb = new StringBuilder(data.length());
Map<String, String> replacements = Map.of("\\u003C", "<", "\\u003E", ">" /* etc */);
for (int i = 1; i < data.length() - 1; ++i) {
  sb.append(data.charAt(i));

  for (Map.Entry<String, String> entry : replacements.entrySet()) {
    String search = entry.getKey();

    // This is basically checking "endsWith".
    int endIndex = sb.length() - search.length();
    if (endIndex >= 0 && sb.indexOf(search, endIndex) == endIndex) {
      sb.delete(endIndex, sb.length());
      sb.append(entry.getValue());
    }
   }
}
data = sb.toString();

Note that this is memory efficient, like you've asked for; there are ways to make this more time efficient too.

For example, you could compile a Pattern which matches the things you want to replace:

Pattern p = Pattern.compile(
    replacements.keySet()
        .stream()
        .map(Pattern::quote)
        .collect(Collectors.joining("|")));

and then use the Matcher API, which is well-suited to this task:

Matcher m = p.matcher(data);
int prev = 1;
while (m.find()) {
  sb.append(data, prev, m.start());
  sb.append(replacements.get(m.group()));
  prev = m.end();
}
sb.append(data, prev, data.length() - 1);

Ideone demo

If you wanted to extend the Pattern/Matcher approach to cover the head replacement too, you could append "|<head>[\s\S]*?</head>" to the pattern, and then handle it specially in the loop:

if (!m.group().startsWith("<head>")) {
  sb.append(replacements.get(m.group()));
}

But as you start to descend this path of trying to use regex with HTML, you will quickly find its shortcomings...

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • that's a good idea, I thought about using `StringBuilder` instead but did not see the thing you mentioned, though the `String.replace` function is quite good already (not like the regex functions...) – prom85 Oct 20 '17 at 07:10
  • @prom85 the thing about using `String.replace` is that you have to do one replacement at a time, creating a new string each time, which is basically the same as you're already doing. If memory is the issue, you need to create as few strings as possible. – Andy Turner Oct 20 '17 at 07:12
  • Actually, I should use your suggestions to clean and prepare the string and from then on do everything with a html parser instead (like replacing the head tag for example), correct? – prom85 Oct 20 '17 at 07:52
1

Regular expressions in combination with large strings is generally not a good idea. Stronger, you shouldn't parse [X]HTML with regex. Especially when the pattern uses capture groups, it must take care of much. In addition, a <div> inside a <div> will break the code.

You could of course grab a StringBuilder, which conserves a part of the memory, but the problem of parsing HTML with regular expressions still exists.


Edit

It is correct that if you apply replacements within large portions of text, many modified copies of the target text are possibly created. However, some of your requirements can be handled by the parser.

  • Removing tags
    You should be able to do something like this:

    Elements selector = docsoup.select("<your tag>");
    for (Element element : selector) {
        element.remove();
    }
    
MC Emperor
  • 22,334
  • 15
  • 80
  • 130
  • After cleaning the input I'm using a parser, namely `JSoup`... But beforehand I must clean the string and make it a valid input string for the parser... – prom85 Oct 20 '17 at 07:20