How to remove end of cell special characters from a word document in java regex

Question

I am extracting data from tables within a Microsoft word document (.doc).

The data extracts fine but at the end of each extracted value (from each cell) there is a non-printable ^G character which is seriously messing with further processing. I can only see this when I paste the console output into my text editor (TextMate).

What's the best way to remove this using regex. Is this a unicode character? I cant find any reference to ^G non printable characters. I assume its an end of cell character. To be honest I would rather get rid of all non-printable characters but at the moment this is the only one that is causing my any problems so either solution will do.

You can use: `input = input.replaceAll("\\P{Print}", "");` in Java to remove all non-printable characters. — anubhava, Oct 05 '17 at 08:39
See [How can I replace non-printable Unicode characters in Java?](https://stackoverflow.com/questions/6198986/how-can-i-replace-non-printable-unicode-characters-in-java) — Wiktor Stribiżew, Oct 05 '17 at 08:52

score 1 · Accepted Answer · answered Oct 05 '17 at 08:44

To be honest I would rather get rid of all non-printable characters

You may use:

input = input.replaceAll("\\P{Print}", "");

in Java to remove all non-printable characters.

\p{Print} matches all printable characters (including Unicode ones) and \P{Print} does the reverse by matching all non-printable characters.

How to remove end of cell special characters from a word document in java regex

1 Answers1