0

I made a simple parser with java that reads a file one character at a time and constructs words.

I tried to run it under Linux and I noticed that looking for '\n' doesn't work. Although if I compare the character with the value 10 it works as expected. According to the ASCII table value 10 is LF (line feed). I read somewhere (I don't remember where) that Java should be able to find a newline only by looking for '\n'.

I am using BufferedReader and the read method to read characters.

EDIT

readLine cannot be used because it will produce other problems

It looks like the problem appears when I am using files with mac/windows file endings under linux.

kechap
  • 2,077
  • 6
  • 28
  • 50
  • 2
    Please show actual code. – unwind Jan 02 '12 at 12:05
  • See [`line.separator`](http://docs.oracle.com/javase/tutorial/essential/environment/sysprop.html). – trashgod Jan 02 '12 at 12:06
  • possible duplicate of [Java: How do I get a platform independent new line character?](http://stackoverflow.com/questions/207947/java-how-do-i-get-a-platform-independent-new-line-character) – trashgod Jan 02 '12 at 12:06
  • 1
    Its most likely you are doing something wrong. Perhaps you are using readLine() and scanning the line? – Peter Lawrey Jan 02 '12 at 12:08
  • @trashgod I tried it but it has the same result. – kechap Jan 02 '12 at 12:09
  • Normally this simply should work. According to my test here: http://ideone.com/ntk4b So, you need to provide more code, and search the problem somewhere else. – Martijn Courteaux Jan 02 '12 at 12:09
  • The `readLine()` method of [`BufferedReader`](http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html) should work. Please provide an [sscce](http://sscce.org/) that exhibits the problem you describe. – trashgod Jan 02 '12 at 12:18

3 Answers3

2

Use readLine() to read text line by line basis

Example

FileInputStream fstream = new FileInputStream("textfile.txt");
  // Get the object of DataInputStream
  DataInputStream in = new DataInputStream(fstream);
  BufferedReader br = new BufferedReader(new InputStreamReader(in));
  String strLine;
  //Read File Line By Line
  while ((strLine = br.readLine()) != null)   {
  // Print the content on the console
  System.out.println (strLine);
  }
  //Close the input stream
  in.close();
    }catch (Exception e){//Catch exception if any
  System.err.println("Error: " + e.getMessage());
  }
trashgod
  • 203,806
  • 29
  • 246
  • 1,045
Sunil Kumar Sahoo
  • 53,011
  • 55
  • 178
  • 243
1

If you read files byte by byte you have to take care of all 3 cases '\n' for Linux, "\r\n" for windows and '\r' for mac.

Use the method readLine instead. It takes care of these things for you and returns only the line without any terminators. After reading each line you can tokenize it to get the single words.

Also consider uring the system property "line.separator". It always holds the system dependent Line terminator witch makes at least your code (not the produced files) more portale.

A4L
  • 17,353
  • 6
  • 49
  • 70
  • Mac OS X uses `\n`; Mac OS 9 and earlier used `\r`. – trashgod Jan 02 '12 at 12:15
  • I think that what creates the problem is this `'\r'`. If it has decimal value `13`. – kechap Jan 02 '12 at 12:15
  • good to know mac-guys came away from `'\r'` ... @marcus - you can also use the Charachter static method [isWhitespace(char ch)](http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isWhitespace%28char%29) to throw everything that is a whitespace. Loop (read chars into a strinbuilder untill you encounter a whitespace, construct a word, read while still whitespace) untill no chars left to read – A4L Jan 02 '12 at 12:27
  • @A4L I wish it was that simple. Words are not always separated with a whitespace. – kechap Jan 02 '12 at 12:31
  • what are the separator chars in your usecase? – A4L Jan 02 '12 at 12:38
  • @A4L whitespace , `;`, `,`, `(`, `)`,`!`,`'`,`@`, `~` – kechap Jan 02 '12 at 12:43
1

here are two ways can do it

1- use read line by line and split each using a regular expression to get the single words

2- write your own isDelimiter method and use it to check whether you reached a split contition or not

package misctests;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import java.util.ArrayList;
import java.util.List;
import org.junit.Test;


public class SplitToWords {

    String someWords = "Lorem ipsum\r\n(dolor@sit)amet,\nconsetetur!\rsadipscing'elitr;sed~diam";
    String delimsRegEx = "[\\s;,\\(\\)!'@~]+";
    String delimsPlain = ";,()!'@~"; // without whitespaces

    String[] expectedWords = {
        "Lorem",
        "ipsum",
        "dolor",
        "sit",
        "amet",
        "consetetur",
        "sadipscing",
        "elitr",
        "sed",
        "diam"
    };

    private static final class StringReader {
        String input = null;
        int pos = 0;
        int len = 0;
        StringReader(String input) {
            this.input = input == null ? "" : input;
            len = this.input.length();
        }

        public boolean hasMoreChars() {
            return pos < len;
        }

        public int read() {
            return hasMoreChars() ? ((int) input.charAt(pos++)) : 0;
        }
    }

    @Test
    public void splitToWords_1() {
        String[] actual = someWords.split(delimsRegEx);
        assertEqualsWords(expectedWords, actual);
    }

    @Test
    public void splitToWords_2() {
        StringReader sr = new StringReader(someWords);
        List<String> words = new ArrayList<String>();
        StringBuilder sb = null;
        int c = 0;
        while(sr.hasMoreChars()) {
            c = sr.read();
            while(sr.hasMoreChars() && isDelimiter(c)) {
                c = sr.read();
            }
            sb = new StringBuilder();
            while(sr.hasMoreChars() && ! isDelimiter(c)) {
                sb.append((char)c);
                c = sr.read();
            }
            if(! isDelimiter(c)) {
                sb.append((char)c);
            }
            words.add(sb.toString());
        }

        String[] actual = new String[words.size()];
        words.toArray(actual);

        assertEqualsWords(expectedWords, actual);
    }

    private boolean isDelimiter(int c) {
        return (Character.isWhitespace(c) ||
            delimsPlain.contains(new String(""+(char)c))); // this part is subject for optimization
    }

    private void assertEqualsWords(String[] expected, String[] actual) {
        assertNotNull(expected);
        assertNotNull(actual);
        assertEquals(expected.length, actual.length);
        for(int i = 0; i < expected.length; i++) {
            assertEquals(expected[i], actual[i]);
        }
    }
}
A4L
  • 17,353
  • 6
  • 49
  • 70
  • I will try to implement that. It will affect a lot of code but that's my fault. – kechap Jan 02 '12 at 19:35
  • all you need is the outer while in `splitToWords_2()` ... witch you might already have since you read from the buffered reader byte by byte. The `StringReader` Class is just a kind of mock/substitute for your buffered reader ... note that its read method return int, just as the one of BufferedReader. for delimsPlain you could use a `java.util.Set` and initialize it in a static block, so you can go in isDelimiter with something like `... || delimsPlain.contains((char)c)`. good luck! – A4L Jan 02 '12 at 20:05