1

I have a text.txt file which contains following txt.

 Kontagent Announces Partnership with Global Latino Social Network Quepasa

 Released By Kontagent

I read this text file into a string documentText.

documentText.subString(0,9) gives Kontagent, which is good.

But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.

To read file as string I used all the functions talked about here How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.

Currently I am using this function to read the file into documentText String:

public static String readFileAsString(String fileName)
{

    File file = new File(fileName);
    StringBuilder fileContents = new StringBuilder((int)file.length());
    Scanner scanner = null;
    try {
        scanner = new Scanner(file);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
    String lineSeparator = System.getProperty("line.separator");

    try {
        while(scanner.hasNextLine()) {
            fileContents.append(scanner.nextLine() + lineSeparator);
        }
        return fileContents.toString();
    } finally {
        scanner.close();
    }
}

EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode. Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.

Community
  • 1
  • 1
Watt
  • 3,118
  • 14
  • 54
  • 85
  • How about trimming all the spaces, so you don't have to worry about spacing? – roymustang86 Jul 13 '12 at 17:10
  • I cannot trim spaces. The char offset needs to be preserved as it is. If I trim spaces in my code, I will have to make sure other third party components which worked and will work on this file too do the same. The offset information is communicated to me through a different file. (Hope that was not very vague explanation :) ) – Watt Jul 13 '12 at 17:18

3 Answers3

2

On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.

Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.

Good luck!

Daniel Li
  • 14,976
  • 6
  • 43
  • 60
  • Understood. But, String lineSeparator = System.getProperty("line.separator"); in the function will take care of that already? – Watt Jul 13 '12 at 17:11
  • Yes, but if the file has `\n` instead of `\r\n`, it will assume your line separator is `\r\n` and take off two characters instead. This is why your output is 2 characters early (two `\r` values that shouldn't have been counted) – Daniel Li Jul 13 '12 at 17:12
  • 1
    You should account for the length of the separator. So in getting the substring, take into account the number of new lines that fall before the substring. "hello\nWorld" (Unix) has a different from "hello/r/nWorld" (windows) – DZittersteyn Jul 13 '12 at 17:15
  • +1, thanks for explaining about \n and \r. Looks like you nailed the problem. I will be back shortly and accept the answer if it worked. – Watt Jul 13 '12 at 17:36
  • @DZittersteyn that's great idea too. – Watt Jul 13 '12 at 17:41
2

The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.

EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:

String s = CharStreams.toString(new FileReader(fileName)); 
JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255
  • +1, Thanks for information, I dind't know about this one liner Guava's method. It looks like as "Hope I helped" pointed out, the problem is with the file transfer itself. I will get back here, once I have got it working. – Watt Jul 13 '12 at 17:35
  • The problem is not in the file transfer. I thought it was before you added the code. But the problem is in your code: you append \r\n to each line on Windows, and \n on Unix, because that's what System.getProperty("line.separator") returns. – JB Nizet Jul 13 '12 at 18:17
  • For some reason, If I use String s = CharStreams.toString(new FileReader(fileName)); to read file, I still get same error documentText.Substring() result. BTW, is there a way to attach my txt file to this question? Thanks! – Watt Jul 13 '12 at 18:23
  • Which error? You just need to know which kind of EOL your file uses. If you use the same file on both OSes, it should lead to the same result. – JB Nizet Jul 13 '12 at 18:24
  • The error: documentText.subString(87,96) gives y Kontage. I figured out by watching documentText that the file has \r\n as EOL – Watt Jul 13 '12 at 18:31
  • Thanks, you were very helpful. Now, I know the reason why it is happening. I edited the question. Please see last paragraph in the question. – Watt Jul 13 '12 at 18:50
0

Based on input you guys provided, I wrote something like this

documentText  = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");

to strip off extra \r if a file has \r.

Now,I am getting expect result in windows environment as well as unix. Problem solved!!!

It works fine irrespective of what mode file has been copied.

:) I wish I could chose both of your answer, but stackoverflow doesn't allow.

Watt
  • 3,118
  • 14
  • 54
  • 85