2

I have a CSV file containing some French words (with accents). I want to read this file using Java and convert the accented letters to non-accented letters. For example, é should be read as e. I have tried the following:

CSVReader reader = new CSVReader(new FileReader(file));
String[] line;
while ((line = reader.readNext()) != null) {
      line[0] = Normalizer.normalize(line[0], Normalizer.Form.NFD)
                        .replaceAll("[^\\p{ASCII}]", "").replaceAll("[^a-zA-Z0-9:_']", "_");
      System.out.println("LINE[0]: "+line[0]);
}

If suppose, the file contains the line "Arts_et_Métiers", the output is "Arts_et_MAtiers" where the accented letter is replaced by 'A' and not 'e'. Is there something that I am doing wrong? Any help will be appreciated.

Thanks.

BajajG
  • 2,134
  • 2
  • 25
  • 32
  • 2
    Are you sure it's reading the file correctly to start with? You're using `FileReader`, which always uses the platform-default encoding - not generally a good idea. – Jon Skeet Aug 07 '15 at 08:54
  • The code is working for me with that input text. It's probably an encoding problem. – Laurentiu L. Aug 07 '15 at 08:55
  • Further to the above comments re encoding, see [my answer on this topic](http://stackoverflow.com/a/21824010/2071828). And **always make sure to close your resources** - in this case `try-with-resources` would be the correct approach. – Boris the Spider Aug 07 '15 at 08:59
  • you also could try [this](http://stackoverflow.com/a/27789934/3998458) – Alex S. Diaz Aug 07 '15 at 09:00
  • @JonSkeet: FileReader could be a possible problem! Will check that and let you know! :) – BajajG Aug 07 '15 at 09:03
  • @BoristheSpider: This code snippet is just a small part. The complete code ensures that all the resources are closed after use. – BajajG Aug 07 '15 at 09:04
  • @AlexandroSifuentesDíaz: I tried using the StringUtils.stripAccents() from Apache Commons but it doesn't help! – BajajG Aug 07 '15 at 09:08
  • @JonSkeet Do you suggest some encoding scheme that I should use? The original file is in .xlsx format, and I am reading it using a CSVReader. – BajajG Aug 07 '15 at 09:12
  • @Neha: I would suggest using something which is designed to read Excel files... they're generally not just CSV files. – Jon Skeet Aug 07 '15 at 09:15
  • Maybe do `.replace("œ", "oe")` too. Form.NFKD unfortunately does not do that I think. – Joop Eggen Aug 07 '15 at 09:16
  • Maybe the .xlsx in reality is a text file (as CSVReader reads it), just renamed to let it be opened by Excel. Open it with a programmers editor like Notepad++ or JEdit and you can find out its encoding. "Cp1252" or "UTF-8" probably. – Joop Eggen Aug 07 '15 at 09:18
  • So sorry, the file is a CSV file, and not .xlsx file – BajajG Aug 07 '15 at 09:38

0 Answers0