How to replace characters using Regex

Question

I received string from IBM Mainframe like below (2bytes graphic fonts)

"　;Ａ;Ｂ;Ｃ;Ｄ;Ｅ;Ｆ;Ｇ;Ｈ;Ｉ;Ｊ;Ｋ;Ｌ;Ｍ;Ｎ;Ｏ;Ｐ;Ｑ;Ｒ;Ｓ;Ｔ;Ｕ;Ｖ;Ｗ;Ｘ;Ｙ;Ｚ;ａ;ｂ;ｃ;ｄ;ｅ;ｆ;ｇ;ｈ;ｉ;ｊ;ｋ;ｌ;ｍ;ｎ;ｏ;ｐ;ｑ;ｒ;ｓ;ｔ;ｕ;ｖ;ｗ;ｘ;ｙ;ｚ;０;１;２;３;４;５;６;７;８;９;｀;－;＝;￦;～;！;＠;＃;＄;％;＾;＆;＊;（;）;＿;＋;｜;［;］;｛;｝;：;＂;＇;，;．;／;＜;＞;？;";

and, I wanna change these characters to 1 byte ascii codes

How can I replace these using java.util.regex.Matcher, String.replaceAll() in Java

target characters :

;A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z;a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;0;1;2;3;4;5;6;7;8;9;`;-;=;\;~;!;@;#;$;%;^;&;*;(;);_;+;|;[;];{;};:;";';,;.;/;<;>;?;";

regexes shouldn't be used for character encoding translation. See [Encoding conversion in java](http://stackoverflow.com/questions/229015/encoding-conversion-in-java). — outis, Nov 17 '11 at 09:20
This is not a duplicate of that other question. The OP is talking about actual characters, mostly from the [Halfwidth and Fullwidth Forms block](http://www.fileformat.info/info/unicode/block/halfwidth_and_fullwidth_forms/index.htm), that need to be replaced with ASCII characters. — Alan Moore, Nov 17 '11 at 14:35

score 2 · Answer 1 · answered Nov 17 '11 at 14:27

This is not (as other responders are saying) a character-encoding issue, but regexes are still the wrong tool. If Java had an equivalent of Perl's tr/// operator, that would be the right tool, but you can hand-code it easily enough:

public static String convert(String oldString)
{
  String oldChars = "　ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ０１２３４５６７８９｀－＝￦～！＠＃＄％＾＆＊（）＿＋｜［］｛｝：＂＇，．／＜＞？";
  String newChars = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`-=\\~!@#$%^&*()_+|[]{}:\"',./<>?";

  StringBuilder sb = new StringBuilder();
  int len = oldString.length();
  for (int i = 0; i < len; i++)
  {
    char ch = oldString.charAt(i);
    int pos = oldChars.indexOf(ch);
    sb.append(pos < 0 ? ch : newChars.charAt(pos));
  }
  return sb.toString();
}

I'm assuming each character in the first string corresponds to the character at the same position in the second string, and that the first character (U+3000, 'IDEOGRAPHIC SPACE') should be converted to an ASCII space (U+0020).

Be sure to save the source file as UTF-8, and include the -encoding UTF-8 option when you compile it (or tell your IDE to do so).

score 0 · Answer 2 · answered Nov 17 '11 at 09:22

0

Don't think this one's about regex, it's about encoding. Should be possible to read into a String with 2-byte and then write it with any other encoding. Look here for supported encodings.

answered Nov 17 '11 at 09:22

Kai Huppmann

10,705
6
47
78

It's hard to test/tell from here without having the original byte stream – Kai Huppmann Nov 17 '11 at 10:37

How to replace characters using Regex

2 Answers2