2

I'm working on a small text file with a list of words in it that I want to add a new word to, and then sort. The file doesn't have a newline at the end when I start, but does after the sort. Why? Can I avoid this behavior or is there a way to strip the newline back out?

Example:

words.txt looks like

apple
cookie
salmon

I then run printf "\norange" >> words.txt; sort words.txt -o words.txt

I use printf rather than echo figuring that'll avoid the newline, but the file then reads

apple
cookie
orange
salmon
#newline here

If I just run printf "\norange" >> words.txt orange appears at the bottom of the file, with no newline, ie;

apple
cookie
salmon
orange
Community
  • 1
  • 1
Alex
  • 2,555
  • 6
  • 30
  • 48
  • 1
    `sort` figures it is doing you a favor. Mine always reports `sort: warning: newline appended` Various versions of `sort` have different features. Comb the man pages of your available versions, maybe you'll find a cmd-arg `--no-newline` or similar. In the future please post such non-programming related Qs (IMHO) to https://unix.stackexchange.com or https://superuser.com . Good luck. – shellter Jan 08 '18 at 16:23
  • 1
    A "text file" without a trailing newline is not a valid UNIX text file. Many tools will outright ignore any line without a trailing newline -- any [BashFAQ #1](http://mywiki.wooledge.org/BashFAQ/001) `while read` loop, for instance, will exit on such lines rather than processing them. – Charles Duffy Jan 08 '18 at 16:25
  • 1
    BTW, a single trailing newline doesn't create a blank line (as you've rendered here) -- it just ensures that the line before it is complete, ie. not leaving the cursor hanging waiting for more content, or leaving a programmatic reader unclear as to whether the file was fully flushed. – Charles Duffy Jan 08 '18 at 16:35

2 Answers2

7

This behavior is explicitly defined in the POSIX specification for sort:

The input files shall be text files, except that the sort utility shall add a newline to the end of a file ending with an incomplete last line.

As a UNIX "text file" is only valid if all lines end in newlines, as also defined in the POSIX standard:

Text file - A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the newline character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
6

Think about what you are asking sort to do.

You are asking it "take all the lines, and sort them in order."

You've given it a file containing four lines, which it splits to the following strings:

"salmon\n"
"cookie\n"
"orange"

It sorts these for you dutifully:

"cookie\n"
"orange"
"salmon\n"

And it then outputs them as a single string:

"cookie
orangesalmon
"

That is almost certainly exactly what you do not want.

So instead, if your file is missing the terminating newline that it should have had, the sort program understands that, most likely, you still intended that last line to be a line, rather than just a fragment of a line. It appends a \n to the string "orange", making it "orange\n". Then it can be sorted properly, without "orange" getting concatenated with whatever line happens to come immediately after it:

"cookie\n"
"orange\n"
"salmon\n"

So when it then outputs them as a single string, it looks a lot better:

"cookie
orange
salmon
"

You could strip the last character off the file, the one from the end of "salmon\n", using a range of handy tools such as awk, sed, perl, php, or even raw bash. This is covered elsewhere, in places like:

How can I remove the last character of a file in unix?

But please don't do that. You'll just cause problems for all other utilities that have to handle your files, like sort. And if you assume that there is no terminating newline in your files, then you will make your code brittle: any part of the toolchain which "fixes" your error (as sort kinda does here) will "break" your code.

Instead, treat text files the way they are meant to be treated in unix: a sequence of "lines" (strings of zero or more non-newline bytes), each followed by a newline.

So newlines are line-terminators, not line-separators.

There is a coding style where prints and echos are done with the newline leading. This is wrong for many reasons, including creating malformed text files, and causing the output of the program to be concatenated with the command prompt. printf "orange\n" is correct style, and also more readable: at a glance someone maintaining your code can tell you're printing the word "orange" and a newline, whereas printf "\norange" looks at first glance like it's printing a backslash and the phrase "no range" with a missing space.

Dewi Morgan
  • 1,143
  • 20
  • 31