5

I need to replace all "&" symbols with "&#38" in my text file but not the html codes such as & or "

I'm currently using row = row.replace("& ", "&#38");

but, as I said also the html codes are replaced e.g. " and I don't want this.. thanks

ps. I cannot add spaces after & because I need to replace it in words such as M&M or Ella & David

aneuryzm
  • 63,052
  • 100
  • 273
  • 488

5 Answers5

4

You could try a regex, e.g,

row = row.replaceAll("&(?![#a-zA-Z0-9]+;)", "&");

The regex replace & given that it's not followed by a sequence of '#a-zA-Z0-9' ending with ';'

Johan Sjöberg
  • 47,929
  • 21
  • 130
  • 148
  • sorry, there was an error in my question. The html codes do not have # after the &, but they have few letters (different lenght) ending with a ; – aneuryzm Feb 24 '11 at 10:03
  • Your regex doesn't work for `ō` form of strings. What you probably need is `row.replaceAll("&(?![#a-zA-Z0-9]+;)", "&");` – adarshr Feb 24 '11 at 10:12
  • @adarshr, that wasn't clear from the question, but in all fairness, you are completely right! I'll update accordingly, thx. – Johan Sjöberg Feb 24 '11 at 10:16
1

There's no general solution, since in your text there may be things like

&

which may mean either a single ampersand or be a malformed way of saying & which should be expressed as

&

However, the latter is quite improbable (unless you're escaping some HTML).

So try something like

row = row.replaceAll("&(?!(?:\\#|amp|quot|nbsp|\\d+);)", "&");

Btw., &#38 is missing the final semicolon. Prefer & to using ASCII codes.

maaartinus
  • 44,714
  • 32
  • 161
  • 320
0

The pattern "& " should be "&\\s", since whitespace has a pattern identifier too.

So the line should read row = row.replace("&\\s", "&#38");

MattLBeck
  • 5,701
  • 7
  • 40
  • 56
0

Try

String replacedAmpersands = row.replaceAll("&(?!(?:#\\d+|\\p{L}+);)", "&")

This will only replace ampersands that are not followed by #\d+; (hash, numbers, semicolon) or \p{L}+; (letters, semicolon).

Christoffer Hammarström
  • 27,242
  • 4
  • 49
  • 58
0

This solution is more involved but my feeling is that it is fullproof, whereas the regex solutions may not be 100% correct (as per the famous "do not use regex for HTML stackoverflow thread").

Using Jsoup:

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

This will give you for sure a text only containing the ampersands you need, not the rest.

Then create a Map containing on the left-hand side the phrases like M&M and Ella & David and then on the right hand side the phrases M&M and Ella & David

The final step is going back to the initial HTML text and replacing the strings on the LHS of the map with those of the RHS.

Edit: you can of course use any HTML parser you like - just wanted to give you a quick example of how easy it is to use one.

Community
  • 1
  • 1
Lucas Zamboulis
  • 2,494
  • 5
  • 24
  • 27