29

I am trying to parse the Linux /etc/passwd file in Java. I'm currently reading each line through the java.util.Scanner class and then using java.lang.String.split(String) to delimit each line.

The problem is that the line:

list:x:38:38:Mailing List Manager:/var/list:/bin/sh" 

is treated by the scanner as 3 different lines:

  1. list:x:38:38:Mailing
  2. List
  3. Manager...

When I type this out into a new file that I didn't get from Linux, Scanner parses it properly.

Is there something I'm not understanding about new lines in Linux?

Obviously a work around is to parse it without using scanner, but it wouldn't be elegant. Does anyone know of an elegant way to do it?

Is there a way to convert the file into one that would work with Scanner?


Not even two days ago: Historical reason behind different line ending at different platforms

EDIT

Note from the original author:

"I figured out I have a different error that is causing the problem. Disregard question"

Community
  • 1
  • 1
jbu
  • 15,831
  • 29
  • 82
  • 105
  • I figured out I have a different error that is causing the problem. Disregard question. – jbu Jan 08 '09 at 23:06
  • can you document the real problem & solution, for completeness? add as an edit to the question (if you have enough rep points) – Kevin Haines Jan 08 '09 at 23:24

5 Answers5

60

From Wikipedia:

  • LF: Multics, Unix and Unix-like systems (GNU/Linux, AIX, Xenix, Mac OS X, FreeBSD, etc.), BeOS, Amiga, RISC OS, and others
  • CR+LF: DEC RT-11 and most other early non-Unix, non-IBM OSes, CP/M, MP/M, DOS, OS/2, Microsoft Windows, Symbian OS
  • CR: Commodore machines, Apple II family, Mac OS up to version 9 and OS-9

I translate this into these line endings in general:

  • Windows: '\r\n'
  • Mac (OS 9-): '\r'
  • Mac (OS 10+): '\n'
  • Unix/Linux: '\n'

You need to make your scanner/parser handle the unix version, too.

Michael Haren
  • 105,752
  • 40
  • 168
  • 205
  • thank you, that does explain things, but I figured out I have a different error that is causing the problem. – jbu Jan 08 '09 at 23:05
  • Does Mac still use '\r'? – Michael Myers Jan 08 '09 at 23:15
  • @mmeyers: no, not since OS X. – Greg Hewgill Jan 08 '09 at 23:20
  • I updated the table with the Mac info – Michael Haren Jan 08 '09 at 23:21
  • Also, if the file is opened in text mode, the native line endings will be converted to simply \n (at least on Windows; I think it's part of the C standard and thus used most everywhere) – rmeador Jan 08 '09 at 23:22
  • I figured out that I was doing something wrong with Scanner. Though your answers were still helpful in understanding linux vs. windows. Thanks guys. – jbu Jan 08 '09 at 23:14
  • For quick Debug you can open notepad++ "EDIT - > EOL Conversion - > Windows or Linux or Mac" – user1767754 Jul 23 '14 at 13:00
  • All the early RFCs like Telnet, Email, Privacy Enhanced Mail (PEM), used `'\r\n'`, too. It is not just a Windows thing. Some of the modern RFCs gave up and say you must handle any line ending. The SSH file format RFC even says you write using the native host's format, so you know there will be at least 3 cases in the wild. – jww Oct 20 '19 at 13:07
11

You can get the standard line ending for your current OS from:

System.getProperty("line.separator")
Chase Seibert
  • 15,703
  • 8
  • 51
  • 58
4

The scanner is breaking at the spaces.

EDIT: The 'Scanning' Java Tutorial states:

By default, a scanner uses white space to separate tokens. (White space characters include blanks, tabs, and line terminators. For the full list, refer to the documentation for Character.isWhitespace.)

You can use the useDelimiter() method to change these defaults.

Kevin Haines
  • 2,492
  • 3
  • 18
  • 19
1

This works for me on Ubuntu

import java.util.Scanner;
import java.io.File;

public class test {
  public static void main(String[] args) {
    try {
      Scanner sc = new Scanner(new File("/etc/passwd"));
      String l;
      while( ( l = sc.nextLine() ) != null ) {
        String[] p = l.split(":");
        for(String pi: p) System.out.print( pi + "\t:\t" );
        System.out.println();
      }
    } catch(Exception e) { e.printStackTrace(); }
  }
}
nEJC
  • 545
  • 7
  • 11
0

Why not use LineNumberReader?

If you can't do that, what does the code look like?

The only difference I can think of is that you are splitting on a bad regex and that when you edit the file yourself, you get dos newlines that somehow pass your regex.

Still, for reading things one line at a time, it seems like overkill to use Scanner.

Of course, why you are parsing /etc/passwd is a hole other discussion :)

davetron5000
  • 24,123
  • 11
  • 70
  • 98
  • I am parsing /etc/passwd to get the user's associated group name from the /etc/group file. – jbu Jan 08 '09 at 23:14