How to preserve correct offset of string which is read from a file

Question

I have a text.txt file which contains following txt.

 Kontagent Announces Partnership with Global Latino Social Network Quepasa

 Released By Kontagent

I read this text file into a string documentText.

documentText.subString(0,9) gives Kontagent, which is good.

But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.

To read file as string I used all the functions talked about here How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.

Currently I am using this function to read the file into documentText String:

public static String readFileAsString(String fileName)
{

    File file = new File(fileName);
    StringBuilder fileContents = new StringBuilder((int)file.length());
    Scanner scanner = null;
    try {
        scanner = new Scanner(file);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
    String lineSeparator = System.getProperty("line.separator");

    try {
        while(scanner.hasNextLine()) {
            fileContents.append(scanner.nextLine() + lineSeparator);
        }
        return fileContents.toString();
    } finally {
        scanner.close();
    }
}

EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode. Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.

How about trimming all the spaces, so you don't have to worry about spacing? — roymustang86, Jul 13 '12 at 17:10
I cannot trim spaces. The char offset needs to be preserved as it is. If I trim spaces in my code, I will have to make sure other third party components which worked and will work on this file too do the same. The offset information is communicated to me through a different file. (Hope that was not very vague explanation :) ) — Watt, Jul 13 '12 at 17:18

score 2 · Answer 1 · answered Jul 13 '12 at 17:10

2

On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.

Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.

Good luck!

answered Jul 13 '12 at 17:10

Daniel Li

14,976
6
43
60

Understood. But, String lineSeparator = System.getProperty("line.separator"); in the function will take care of that already? – Watt Jul 13 '12 at 17:11
Yes, but if the file has `\n` instead of `\r\n`, it will assume your line separator is `\r\n` and take off two characters instead. This is why your output is 2 characters early (two `\r` values that shouldn't have been counted) – Daniel Li Jul 13 '12 at 17:12
1

You should account for the length of the separator. So in getting the substring, take into account the number of new lines that fall before the substring. "hello\nWorld" (Unix) has a different from "hello/r/nWorld" (windows) – DZittersteyn Jul 13 '12 at 17:15
+1, thanks for explaining about \n and \r. Looks like you nailed the problem. I will be back shortly and accept the answer if it worked. – Watt Jul 13 '12 at 17:36
@DZittersteyn that's great idea too. – Watt Jul 13 '12 at 17:41

JB Nizet · Accepted Answer · 2012-07-13T17:21:02.937

2

The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.

EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:

String s = CharStreams.toString(new FileReader(fileName));

edited Jul 13 '12 at 17:21

answered Jul 13 '12 at 17:11

JB Nizet

678,734
91
1,224
1,255

+1, Thanks for information, I dind't know about this one liner Guava's method. It looks like as "Hope I helped" pointed out, the problem is with the file transfer itself. I will get back here, once I have got it working. – Watt Jul 13 '12 at 17:35
The problem is not in the file transfer. I thought it was before you added the code. But the problem is in your code: you append \r\n to each line on Windows, and \n on Unix, because that's what System.getProperty("line.separator") returns. – JB Nizet Jul 13 '12 at 18:17
For some reason, If I use String s = CharStreams.toString(new FileReader(fileName)); to read file, I still get same error documentText.Substring() result. BTW, is there a way to attach my txt file to this question? Thanks! – Watt Jul 13 '12 at 18:23
Which error? You just need to know which kind of EOL your file uses. If you use the same file on both OSes, it should lead to the same result. – JB Nizet Jul 13 '12 at 18:24
The error: documentText.subString(87,96) gives y Kontage. I figured out by watching documentText that the file has \r\n as EOL – Watt Jul 13 '12 at 18:31
Thanks, you were very helpful. Now, I know the reason why it is happening. I edited the question. Please see last paragraph in the question. – Watt Jul 13 '12 at 18:50

Watt · Answer 3 · 2012-07-13T19:25:46.070

Based on input you guys provided, I wrote something like this

documentText  = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");

to strip off extra \r if a file has \r.

Now,I am getting expect result in windows environment as well as unix. Problem solved!!!

It works fine irrespective of what mode file has been copied.

:) I wish I could chose both of your answer, but stackoverflow doesn't allow.

How to preserve correct offset of string which is read from a file

3 Answers3