using regular expression and reconstructing original string

Question

I have text like this -

This is a test text. <span> with bold </span> and with <span> italic </span> and so on and so forth.

Now, I am using this regex to identify all html <[^>]*> I am then replacing all of the html with empty strings, so the result would be like this

This is a test text. with bold and with italic and so and so forth.

In the above text, I want to identify text, say, "italic" and insert special tags around it and then reconstruct the original text. So, the result would be

This is a test text. <span> with bold </span> and with <span> <span class='special'>italic</span> </span> and so on and so forth.

I am creating code that gets the matcher.start() and matcher.end() to make a list of all the html tags, then I am thinking about reconstrucing based on this list. Is there a better way to doing it? How would you solve it?

EDIT

The reason for searching for text after replacing html is because, the html interfers with the text I am looking for. So for instance, it could be like this

This is a test text. <span> with bold </span> and with <span> it</span>al<span>ic </span> and so on and so forth.

EDIT2

This is not a duplicate question like it is being suggested. Imagine a scenario, where you need to highlight the html that you see on screen, by doing nothing but adding a simple span with background color of yellow to the text of your choice. Now, imagine that this text is the word italic, but it appears as <span>ita</span>l<span>ic</span>. My question is how would you find that word and then add span around it?

EDIT3 Final edit to simplify the problem statement. I hope this makes it clear. This is the input -

This is a test text with <span>it<span>al<span>ic</span> and etc.

This is the expected output -

This is a test text with <span class='highlight'><span>it<span>al<span>ic</span></span> and etc.

How would you identify which text had tags after it's replaced? — shmosel, May 13 '16 at 20:55
from the original text, i know where the tags are present, which offsets. — Jay, May 13 '16 at 20:56
Is there a particular reason to strip the html tags and then re-add them? It would seem more efficient to not remove/replace. — KevinO, May 13 '16 at 20:57
Then why don't you use the original string instead of "reconstructing" the tags? — shmosel, May 13 '16 at 20:57
the challenge is that the html is not as simple as illustrated above, it interferes with text I am looking for. So for instance, it won't be as simple as italic , but it would be italic. As you can see, I need to search for the world italic and with the html present in such irregular manner, I need to always remove it first before searching for it. — Jay, May 13 '16 at 21:00
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Vince, May 13 '16 at 21:11
nope, its not a duplicate of that question. my context is not about matching, but replacing. — Jay, May 13 '16 at 21:16

score 1 · Accepted Answer · answered May 13 '16 at 22:44

This will do what you're looking for, but it doesn't detect/prevent faulty html generation.

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class HtmlHighlighter {
  private final String inputWithoutTags;
  private final List<Tag> tags;

  private static class Tag {
    private final String text;
    private final int startPos;

    private Tag(final String text, final int startPos) {
      this.text = text;
      this.startPos = startPos;
    }
  }

  public HtmlHighlighter(final String input, final String tagRegex) {
    final Pattern p = Pattern.compile(tagRegex);
    tags = new ArrayList<>();
    final Matcher m = p.matcher(input);
    StringBuffer sb = new StringBuffer();
    int cursor = 0;
    int cursorExcludingTags = 0;
    while (m.find()) {
      cursorExcludingTags += m.start() - cursor;
      tags.add(new Tag(input.substring(m.start(), m.end()), cursorExcludingTags));
      cursor = m.end();
      m.appendReplacement(sb, "");
    }
    m.appendTail(sb);
    inputWithoutTags = sb.toString();
  }

  public String highlightText(String regexToFind, String openingTag, String closingTag) {
    final List<Tag> allTags = getAllTags(regexToFind, openingTag, closingTag);
    return combineTags(allTags);
  }

  private List<Tag> getAllTags(final String regexToFind, final String openingTag, final String closingTag) {
    final List<Tag> ret = new ArrayList<>(tags);
    final Pattern p = Pattern.compile(regexToFind);
    final Matcher m = p.matcher(inputWithoutTags);
    while (m.find()) {
      addTag(new Tag(openingTag, m.start()), true, ret);
      addTag(new Tag(closingTag, m.end()), false, ret);
    }
    return ret;
  }

  private void addTag(final Tag tag, final boolean beforeIgnored, final List<Tag> allTags) {
    for (int i = 0; i < allTags.size(); i++) {
      if (allTags.get(i).startPos >= tag.startPos && beforeIgnored) {
        allTags.add(i, tag);
        return;
      }
      if (allTags.get(i).startPos > tag.startPos) {
        allTags.add(i, tag);
        return;
      }
    }
    allTags.add(allTags.size(), tag);
  }

  private String combineTags(final List<Tag> allTags) {
    final StringBuilder sb = new StringBuilder(inputWithoutTags);
    for (int i = allTags.size() - 1; i >= 0; i--) {
      final Tag tag = allTags.get(i);
      sb.insert(tag.startPos, tag.text);
    }
    return sb.toString();
  }

  public static void main(String... args) {
    final HtmlHighlighter highlighter = new HtmlHighlighter("This is a test text with <span>it<span>al<span>ic</span> and etc.", "\\<.*?\\>");
    System.out.println(highlighter.highlightText("italic", "<span class='highlight'>", "</span>"));
  }
}

thanks, I had something similar in mind. I wrote something which figures out where the tags are, where the text is at (offsets). Then calculate how much the text offset changes with replacements of html tags with empty string. I will try this code out. — Jay, May 14 '16 at 08:07
why do you say it doesn't detect/ prevent faulty generation? My HTML will be complete, it never is incomplete/broken. — Jay, May 14 '16 at 08:08
suppose you want to highlight `world` in the following string: `Hello world`. `Hello world` isn't valid html — Andreas, May 14 '16 at 18:41
ok, valid point. In my case, it will always be span. I guess, that saves me. — Jay, May 14 '16 at 18:47
your answer is good, I tried it. But it does fail often on my input text. I wrote my own logic to fix this. I am awarding the answer to you - Thank you for attempting it and writing such neat readable code. I am giving up hope right after asking a question on stackoverflow, after asking questions for many years - discouraged by the jokers who keep downvoting without using their brain. It is white knights like you who keep my hope alive :-). Once again, thanks for your help. — Jay, May 14 '16 at 18:50

using regular expression and reconstructing original string

1 Answers1