Check line for unprintable characters while reading text file

Question

My program must read text files - line by line. Files in UTF-8. I am not sure that files are correct - can contain unprintable characters. Is possible check for it without going to byte level? Thanks.

Do you mean character which cannot be printed in a specific font? There are characters which are undefined in any font. This might be the same thing. — Peter Lawrey, Sep 14 '11 at 09:16

T.J. Crowder · Answer 1 · 2015-04-14T07:13:36.770

120

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.

E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):

String line;
try (
    InputStream fis = new FileInputStream("the_file_name");
    InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
    BufferedReader br = new BufferedReader(isr);
) {
    while ((line = br.readLine()) != null) {
        // Deal with the line
    }
}

edited Apr 14 '15 at 07:13

answered Sep 14 '11 at 09:12

T.J. Crowder

1,031,962
187
1,923
1,875

2

Or, for one less step, open the file with a FileReader and use a BufferedReader to read lines. – Warren Dew Apr 28 '14 at 07:27
1

@stviper: And now it's 2015, I've updated it to use try-with-resources, much cleaner. :-) – T.J. Crowder Jan 07 '15 at 16:21
1

@abhisheknaik96: Thank you for your edit, but only the `isr` bit was correct; the `()` are **supposed** to be `()`, not `{}`, and the last semicolon isn't required (but it's allowed, so I've left it -- more in keeping with the lines above it). – T.J. Crowder Apr 14 '15 at 07:07

Jon Skeet · Answer 2 · 2011-09-14T09:20:06.027

49

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:

List<String> lines = Files.readLines(file, Charsets.UTF_8);

You can then do whatever you like with those lines.

EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

edited Sep 14 '11 at 09:20

answered Sep 14 '11 at 09:09

Jon Skeet

1,421,763
867
9,128
9,194

4

Guava alse propose a method with callback Files.readLines(File file, Charset charset, LineProcessor callback) – Vlagorce Aug 21 '12 at 08:13
If the purpose is to process line by line, using BufferedRead is as simple. It is also overkilling to add another library dependency just for line reading while the core Java library already supports that. – user172818 Dec 26 '12 at 19:51
5

@user172818: No, it's not as simple... at least not if you're not using Java 7 with its try-with-resources statement. Additionally, I'd be *amazed* at any non-trivial Java program which couldn't benefit from Guava in *multiple* places. It's a great library, and I wouldn't be without it. – Jon Skeet Dec 26 '12 at 20:45

McIntosh · Answer 3 · 2019-08-27T13:19:12.600

42

Just found out that with the Java NIO (java.nio.file.*) you can easily write:

List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
  System.out.println(line);
}

instead of dealing with FileInputStreams and BufferedReaders...

edited Aug 27 '19 at 13:19

answered Oct 11 '12 at 11:17

McIntosh

2,051
1
22
34

Just want to add, java.nio.file.* is available since JDK 7 – Jifeng Zhang May 14 '13 at 12:06
3

It might be worth mentioning the doc for [Files.readAllLines](http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html) : _this method is intended for simple cases where it is convenient to read all lines in a single operation. It is not intended for reading in large files_ – Remi Mélisson Mar 18 '14 at 11:22
@fabian your're right, I'm using this all the time :) – McIntosh Aug 27 '19 at 13:20

score 15 · Accepted Answer · answered Sep 14 '11 at 09:19

15

If you want to check a string has unprintable characters you can use a regular expression

[^\p{Print}]

answered Sep 14 '11 at 09:19

Peter Lawrey

525,659
79
751
1,130

This, however, includes the whitespace and tab characters in your set of non-printing characters while they influence the place of the words in the page. – bernard paulus Sep 06 '13 at 12:34

score 11 · Answer 5 · answered Oct 21 '13 at 10:37

11

How about below:

 FileReader fileReader = new FileReader(new File("test.txt"));

 BufferedReader br = new BufferedReader(fileReader);

 String line = null;
 // if no more lines the readLine() returns null
 while ((line = br.readLine()) != null) {
      // reading lines until the end of the file

 }

Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

answered Oct 21 '13 at 10:37

xproph

1,171
11
7

Nope - delete this - you are using default encoding - and entering a world of pain. – Mr_and_Mrs_D Jun 17 '14 at 11:38

score 5 · Answer 6 · answered Apr 15 '16 at 07:49

I can find following ways to do.

private static final String fileName = "C:/Input.txt";

public static void main(String[] args) throws IOException {
    Stream<String> lines = Files.lines(Paths.get(fileName));
    lines.toArray(String[]::new);

    List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
    readAllLines.forEach(s -> System.out.println(s));

    File file = new File(fileName);
    Scanner scanner = new Scanner(file);
    while (scanner.hasNext()) {
        System.out.println(scanner.next());
    }

Mr_and_Mrs_D · Answer 7 · 2014-06-27T00:52:10.833

2

The answer by @T.J.Crowder is Java 6 - in java 7 the valid answer is the one by @McIntosh - though its use of Charset for name for UTF -8 is discouraged:

List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
    StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }

Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):

BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

edited Jun 27 '14 at 00:52

answered Jun 17 '14 at 17:41

Mr_and_Mrs_D

32,208
39
178
361

An answer valid for Java 6 remains valid for Java 7. – user207421 Aug 27 '19 at 13:31
@user207421 when there is a better way of doing it not really – Mr_and_Mrs_D Aug 30 '19 at 10:10

score 0 · Answer 8 · answered Sep 14 '11 at 09:13

0

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

answered Sep 14 '11 at 09:13

JB Nizet

678,734
91
1,224
1,255

Check line for unprintable characters while reading text file

8 Answers8

Linked