5

How to make regex of a cyrillic string, i want to use it in this a way somehow:

String.replaceAll("Кириллица","")

Of course it doesn't work. What am I to do, to make it work?

Ok,I see that the method works, but it doesn't work for me. How can I check, why does method not execute?

...

Hm, I tried to use s1 = s1.replaceAll("[\\p{InCyrillic}]", ""); for the string I get through the sockets. it works great, all cyrillic chars disapperar, including the word "Экзамен", but if I try s1=s1.replaceAll("Экзамен","") nothing happens.

But method s1=s1.replaceAll("Экзамен","") worked in the same program for a static string defined in this program. I guess that problem may be because of wrong charset, but I still can't understand what am I doing wrong. The charset of the string is windows-1251. I tried to experiment with charset in program (it is jsp now), using methods

System.setProperty("file.encoding", "windows-1251");
response.setCharacterEncoding("windows-1251"); 

tried converting the string from one charset to another. And nothing changes

Andremoniy
  • 34,031
  • 20
  • 135
  • 241
  • What does not work? Can you give an example? There should be no problems. – Henry Jan 15 '13 at 18:02
  • here is an example: I have a string c with cyrillic, that has Экзамен sequence of chars. I do c=c.replaceAll("Экзамен",""); and get a string c=Введение в специальность (Б.3.2.1-ПиКО)60,3Экзамен – user1956641 Jan 15 '13 at 19:06
  • no, the problem is not about tomcat or charset. Can it be so that problem is that i'm doing replace in a long string? – user1956641 Jan 15 '13 at 19:48
  • It should work. If the file is not correctly compiled with correct encoding, or the text input's encoding is incorrect, the replacement will fail. – nhahtdh Jan 16 '13 at 07:49
  • Do you want to replace the sequence "Экзамен", or replace every character in "Экзамен"? – Bohemian Jan 16 '13 at 08:10
  • the sequense. ok, it maybe that the encoding is incorrect. i'll try out – user1956641 Jan 16 '13 at 10:12

2 Answers2

5

It might be more clear if you show your result in case @Henry's answer. I suppose that the issue in characters or encoding. To identify is the String in cyrillic you can with this code:

String s1 = "Экзaмен";
s1 = s1.replaceAll("[\\p{InCyrillic}]", "");
System.out.println(s1);

The code will remove all cyrillic characters and you can identify invalid encoded characters.

If your result will be like "a" or "e", or "ae", It means that in your string exist latin characters which simular to cyrillic, so you should replace using this regex

 s1 = s1.replaceAll("Экз[aa]м[ee]н", "");

where [a-is cyrillic character and a-is latin character] and so on.

If your result will be as "Экзaмен", the issue in encoding and I hope this link will help you

How to determine if a String contains invalid encoded characters

Zhandos
  • 279
  • 1
  • 2
  • 16
1

Just tried this:

String s1 = "Введение в специальность (Б.3.2.1-ПиКО)60,3Экзамен";
String s2 = s1.replaceAll("Экзамен", "");
System.out.println(s2);

The output is:

Введение в специальность (Б.3.2.1-ПиКО)60,3
Henry
  • 42,982
  • 7
  • 68
  • 84
  • hm, but why then I get another result... Maybe problems with charset, or Tomcat... Strange thing is that method doesnt work only on cyrillic. But I don't see the problem. – user1956641 Jan 15 '13 at 19:27