10

I need to clear my string from the following substrings:

\n

\uXXXX (X being a digit or a character)

e.g. "OR\n\nThe Central Site Engineering\u2019s \u201cfrontend\u201d, where developers turn to"

-> "OR The Central Site Engineering frontend , where developers turn to"
I tried using the String method replaceAll but dnt know how to overcome the \uXXXX issue as well as it didnt work for the \n

String s = "\\n";  
data=data.replaceAll(s," ");

how does this regex looks in java?

thanks for the help

D.Shefer
  • 173
  • 1
  • 1
  • 7
  • 1
    Can you describe [what have you tried](http://mattgemmell.com/what-have-you-tried/) and how did it not work? Also your text doesn't look like it should be striped from these characters, but rather they should be replaced with characters they represents like `\n` -> line separator, `\u2019` -> `’`, `\u201c`->`“`, and so on. – Pshemo Aug 02 '15 at 17:24
  • So maybe you are asking [how you can unescape these characters](http://stackoverflow.com/questions/3537706/howto-unescape-a-java-string-literal-in-java)? – Pshemo Aug 02 '15 at 17:33
  • I need to replace them with whitespace. I dont need them since its going to be indexed with Apache lucene, I only need the words showing. – D.Shefer Aug 02 '15 at 17:36
  • "*I need to replace them with whitespace*" based on your example you want to remove them (replace them with nothing) not to replace them with whitespace. But anyway this is not hard task so you must have tried something. Can we see your attempts? – Pshemo Aug 02 '15 at 17:40
  • dealing with \n: string.replaceAll("\\n", " "); also I tried to put \n in a string instead of writing in "inline" – D.Shefer Aug 02 '15 at 17:43
  • Post your attempts inside your question. You can do it by using [edit] option. To add code formatting use `{}` button from editor. – Pshemo Aug 02 '15 at 17:44

2 Answers2

14

Problem with string.replaceAll("\\n", " "); is that replaceAll expects regular expression, and \ in regex is special character used for instance to create character classes like \d which represents digits, or to escape regex special characters like +.

So if you want to match \ in Javas regex you need to escape it twice:

  • once in regex \\
  • and once in String "\\\\".

like replaceAll("\\\\n"," ").

You can also let regex engine do escaping for you and use replace method like

replace("\\n"," ")

Now to remove \uXXXX we can use

replaceAll("\\\\u[0-9a-fA-F]{4}","")


Also remember that Strings are immutable, so each str.replace.. call doesn't affect str value, but it creates new String. So if you want to store that new string in str you will need to use

str = str.replace(..)

So your solution can look like

String text = "\"OR\\n\\nThe Central Site Engineering\\u2019s \\u201cfrontend\\u201d, where developers turn to\"";

text = text.replaceAll("(\\\\n)+"," ")
           .replaceAll("\\\\u[0-9A-Ha-h]{4}", "");
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • many thanks! needed the explanation regarding the replaceAll parameter! – D.Shefer Aug 02 '15 at 18:18
  • 1
    @D.Shefer You are welcome. But I was able to give you this explanation only because you posted your code attempts. Without it I would only post solution without proper explanation which you would not benefit that much, so in future always post your code attempts so people would see what you are struggling with to give you best answers. – Pshemo Aug 02 '15 at 18:21
0

Best to do this in 2 parts I guess:

String ex = "OR\n\nThe Central Site Engineering\u2019s \u201cfrontend\u201d, where developers turn to";
String part1 = ex.replaceAll("\\\\n"," "); // The firs \\ replaces the backslah, \n replaces the n.
String part2 = part1.replaceAll("u\\d\\d\\d\\d","");
System.out.println(part2);

Try it =)

Roel Strolenberg
  • 2,922
  • 1
  • 15
  • 29
  • OK, I was not precise. It seems that example we see in question is not string literal, but text which could for instance be read from file. So `\n` is not line separator, but string representing two characters, ``\`` and `n`. So your solution works, but only because you let Java compiler change `\n` into line separator, which than can be matched by `"\n"` or `"\\n"`. – Pshemo Aug 02 '15 at 17:58
  • The title of this question means the need of using a regex. – Shai Alon May 12 '22 at 06:37