2

What I want to do...

I have a webview in my android app. I get a huge html content from the server as a string and a search string from the application user(the android phone user). Now I break the search string and create a regex out of it. I want all the html content that matches my regex to be highlighted when I display it into my WebView.

What I tried...

Since it is html, I just want to wrap the regex matched words into a pair of tags with yellow background.

  1. Simple regex and replaceAll on the html Content that i get. Very wrong because it screws and replaces even what is inside the '<' and '>'.
  2. I tried using Matcher and Pattern combo. It is difficult to omit what is inside the tags.
  3. I used JSOUP Parser and it worked!

I traverse the html using NodeTraversor class. I used Matcher and Pattern classes to find and replace matched words with tags as i wanted to do.

But it is very slow. And I basically want to use it on Android and the size of it is like 284kB. I removed some unwanted classes and it is now 201kB but it is still too much for an android device. Additionally, the html content can be really large. I looked into JSoup source as well. It kind of iterates over every single character when it parses. I do not know whether all the parsers do the same but it is definitely slow for large html documents.

Here is my code -

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Highlighter {

    private String regex;
    private String htmlContent;
    Pattern pat;
    Matcher mat;


    public Highlighter(String searchString, String htmlString) {
        regex = buildRegexFromQuery(searchString);
        htmlContent = htmlString;
        pat = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    }

    public String getHighlightedHtml() {

        Document doc = Jsoup.parse(htmlContent);

        final List<TextNode> nodesToChange = new ArrayList<TextNode>();

        NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

            @Override
            public void tail(Node node, int depth) {
                if (node instanceof TextNode) {
                    TextNode textNode = (TextNode) node;
                    String text = textNode.getWholeText();

                    mat = pat.matcher(text);

                    if(mat.find()) {
                        nodesToChange.add(textNode);
                    }
                }
            }

            @Override
            public void head(Node node, int depth) {        
            }
        });

        nd.traverse(doc.body());

        for (TextNode textNode : nodesToChange) {
            Node newNode = buildElementForText(textNode);
            textNode.replaceWith(newNode);
        }
        return doc.toString();
    }

    private static String buildRegexFromQuery(String queryString) {
        String regex = "";
        String queryToConvert = queryString;

        /* Clean up query */

        queryToConvert = queryToConvert.replaceAll("[\\p{Punct}]*", " ");
        queryToConvert = queryToConvert.replaceAll("[\\s]*", " ");

        String[] regexArray = queryString.split(" ");

        regex = "(";
        for(int i = 0; i < regexArray.length - 1; i++) {
            String item = regexArray[i];
            regex += "(\\b)" + item + "(\\b)|";
        }

        regex += "(\\b)" + regexArray[regexArray.length - 1] + "[a-zA-Z0-9]*?(\\b))";
        return regex;
    }

    private Node buildElementForText(TextNode textNode) {
        String text = textNode.getWholeText().trim();

        ArrayList<MatchedWord> matchedWordSet = new ArrayList<MatchedWord>();

        mat = pat.matcher(text);

        while(mat.find()) {
            matchedWordSet.add(new MatchedWord(mat.start(), mat.end()));
        }

        StringBuffer newText = new StringBuffer(text);

        for(int i = matchedWordSet.size() - 1; i >= 0; i-- ) {
            String wordToReplace = newText.substring(matchedWordSet.get(i).start, matchedWordSet.get(i).end);
            wordToReplace = "<b>" + wordToReplace+ "</b>";
            newText = newText.replace(matchedWordSet.get(i).start, matchedWordSet.get(i).end, wordToReplace);       
        }
        return new DataNode(newText.toString(), textNode.baseUri());
    }

    class MatchedWord {
        public int start;
        public int end;

        public MatchedWord(int start, int end) {
            this.start = start;
            this.end = end;
        }
    }
}

Here is how I call it -

htmlString = getHtmlFromServer();
Highlighter hl = new Highlighter("Hello World!", htmlString);
new htmlString = hl.getHighlightedHTML();

I am sure what i'm doing is not the most optimal way. But I can't seem to think of anything else.

I want to - reduce the time it takes to highlight it. - reduce the size of library

Any suggestions?

Sudarshan Bhat
  • 3,772
  • 2
  • 26
  • 53

2 Answers2

2

How about highlighting them using javascript?

You know, everybody love javascript, and you can find example like this blog.

J-16 SDiZ
  • 26,473
  • 4
  • 65
  • 84
0

JTidy and HTMLCleaner are aloso among the best Java HTML Parser.

see Comparison between different Java HTML Parser

and

What are the pros and cons of the leading Java HTML parsers?

Community
  • 1
  • 1
Sunil Kumar Sahoo
  • 53,011
  • 55
  • 178
  • 243