Strategy suggestions for removing special characters in java

Question

I've created a Java application that parses a text file to extract fields that are being loaded to a data table. We're discovering some exception processing where the table can't accept special characters, specifically Â and the like.

These characters appear in the input file as spaces when I look at it, but Java interprets them differently. I suspect it's a character code interpreted differently.

My question is this: in order to filter out these characters, is there any way I can generate a list of what Java is seeing? I'm thinking of printing the CHAR and the character code, and if possible, the character ~set~ (ASCII, ANSI, UTF-8, etc). From that, I could substitue a space for the character in my ending file and solve my problem.

Is there a simpler solution I'm not seeing?

Correct solution will be to enable your database to handle such characters. — Jayan, May 09 '12 at 13:00
Don't filter the 'special' characters. Keep them and learn to handle character encodings properly, end-to-end. — artbristol, May 09 '12 at 13:00
Are you certain you read in UTF-8 data correctly in the first place? — Thorbjørn Ravn Andersen, May 09 '12 at 13:03
Unfortunately, @Jayan, that's not an option. The text files are what I was given to work with and I have no idea in which character set(s) they were created. That's the rock and the hard place between which I'm caught, and severly limits my ability to do this the RIGHT way. :) — dwwilson66, May 09 '12 at 13:14
@ dwwilson66 : how will you convert it back to real data? If you have the input as file you could guess the encoding- http://jchardet.sourceforge.net/. Well you can only guess. — Jayan, May 09 '12 at 13:43
@Jayan - from what I've seen of the data, losing the special characters doesn't lose us anything; as I said in my original post, they appear in spaces in the source, and make sense contextually ~as~ spaces, so I just wanna dump the characters. Seems the easiest. — dwwilson66, May 09 '12 at 16:38

score 1 · Answer 1 · answered May 09 '12 at 13:05

1

Try decoding to say, UTF8?

public static byte[] stringToByteArray(String s)
    throws UnsupportedEncodingException {
    return s.getBytes("UTF-8");
}

Or some other like "iso-8859-1" and convert that bytearray to string and try printing it?

answered May 09 '12 at 13:05

hallizh

150
1
10

score 1 · Accepted Answer · edited May 23 '17 at 12:27

1

It sounds like you are crossing character sets or your input files have some kind of control character sequence in it. You should focus your efforts on that side of it and ensure you are working in the proper character set. The only way I can think of to roll up a list of the characters in a file is an array and loop the file.

If you really want to strip all that stuff out, check out this thread

Regular expression for excluding special characters

it explains how to white and blacklist characters with regex.

edited May 23 '17 at 12:27

Community

1
1

answered May 09 '12 at 13:06

scphantm

4,293
8
43
77

I was given the files with no idea of the character set. Agreed that it'd be better to allow for spec chars, but given what I have to work with....I planned the array (though mine is bytes line-by-line since that's how I'm parsing my data already), but the link is incredibly helpful for info on how to code the filter. I really like the idea of whitelisting rather than blacklisting. Thanks! – dwwilson66 May 09 '12 at 13:20

Strategy suggestions for removing special characters in java

2 Answers2