0

I'm trying to write a shell script that (among other things) will replace windows line endings (^M) and vertical tabs (^K) with new lines. Sed looks like the tool to use, but I can't quite get it. I can't see why this won't work..

$ sed -i 's/^K/\n/g' article_filemakerExport.xml 
sed: 1: "article_filemakerExport ...": command a expects \ followed by text

Note: I'm working on a mac.

mklement0
  • 382,024
  • 64
  • 607
  • 775
doub1ejack
  • 10,627
  • 20
  • 66
  • 125

3 Answers3

4

With the Windows line ending, you want to remove the ^M (or \r or carriage return), but you want to replace the ^K with newline, it would seem.

The command I'd use is tr, twice.

tr -d '\r' < article_filemakerExport.xml | tr '\13' '\12' > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$

Given that one operation is delete and the other substitute, I don't think you can combine those into a single tr invocation. You can use cp tmp.$$ article_filemakerExport.xml; rm -f tmp.$$ if you're worried about links, etc.

You could also use dos2unix to convert the CRLF to NL line endings instead of tr.

Note that tr is a pure filter; it only reads standard input and only writes to standard output. It does not read or write files directly.


Actually, I need to replace both of these with a newline.

That's easier: a single invocation of tr will do the job:

tr '\13\15' '\12\12' < article_filemakerExport.xml > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$

Or, if you prefer:

tr '\13\r' '\n\n' < article_filemakerExport.xml > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$

I don't think there's a \z-style notation for control-K, but I'm willing to learn otherwise (it might be vertical tab, \v).

(Added the && and || rm -f tmp.$$ commands at the hinting of Ed Morton.)


Partial list of control characters

 C Oct Dec Hex Unicode Name
\a 07   7  07  U+0007 BELL
\b 10   8  08  U+0008 BACKSPACE
\t 11   9  09  U+0009 HORIZONTAL TABULATION
\n 12  10  0A  U+000A LINE FEED
\v 13  11  0B  U+000B VERTICAL TABULATION
\f 14  12  0C  U+000C FORM FEED
\r 15  13  0D  U+000D CARRIAGE RETURN

You can find a complete set of these control characters at the Unicode site (http://www.unicode.org/charts/PDF/U0000.pdf). No doubt there are many other possible places to look too.

Community
  • 1
  • 1
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • In your first invocation of tr I think you need to redirect instead of specifying the filename. With that fixed your answer is better than mine since tr is standard and dos2unix is not always available. – Peter Bowers Feb 24 '15 at 18:42
  • Thanks, @PeterBowers; yes, it was missing a `<` that I know perfectly well is needed (witness the last paragraph). – Jonathan Leffler Feb 24 '15 at 18:47
  • Actually, I need to replace both of these with a newline. – doub1ejack Feb 24 '15 at 18:48
  • 1
    doub1ejack, I think you will find that @JonathanLeffler's answer does just what you want. A DOS/WIndows "newline character" is actually 2 characters and when you DELETE the \r it leaves the UNIX newline character. Then in his second tr he changes \013 (^K in octal) to \12 (UNIX newline character). Therefore both the DOS newline and the ^K are converted to UNIX newlines. Try it - you'll like it... :-) – Peter Bowers Feb 24 '15 at 19:14
  • 1
    @EdMorton: fair enough...changing that, too. – Jonathan Leffler Feb 24 '15 at 19:19
  • Yes, that seems to work well. What character format is '\013' and where do I look up this conversion? ? I also need to look up whatever `^G` is... – doub1ejack Feb 24 '15 at 19:28
  • 1
    ^G is BEL, or `\a`, or `\007` (the James Bond character), or `\7` (these are octal constants, with or without the leading zeros). I'll add some notes on control-character names to the main answer. `^K` is also 'vertical tabulation', and so `\v` will probably map to that (it did for me on my Mac). – Jonathan Leffler Feb 24 '15 at 19:33
  • I think a \r\n will now become \n\n unless I'm mistaken - is that what is desired? In other words, if a ^M appears at the end of the line do you want it simply deleted (leaving 1 newline) or do you want 2 newlines (by replacing the ^M with a newline)? – Peter Bowers Feb 24 '15 at 19:38
  • @JonathanLeffler: In ascii, \a\b\t\n\v\f\r correspond to \007 - \013 in that order (or ^G - ^M, if you prefer), so \v is indeed ^K. There's probably a good mnemonic for that, but I've forgotten it. – rici Feb 24 '15 at 19:39
  • @PeterBowers: I agree that Windows CRLF line endings and mapping CR to LF would give you double-spaced text, which is probably not what is required. It's also possible that the data is actually coming in as (old-style) Mac lines with just CR line endings, in which case the `\r` to `\n` mapping is what is required. It depends on the exact details of the input data. – Jonathan Leffler Feb 24 '15 at 19:41
  • @rici: thanks for confirming what I'd dug up from other sources too. – Jonathan Leffler Feb 24 '15 at 19:42
1
dos2unix <article_filemakerExport.xml | tr '\013\015' '\n\n'
Peter Bowers
  • 3,063
  • 1
  • 10
  • 18
  • Why is dos2unix needed here? Looks like perhaps it is just a safer way to pipe info into `tr`? Does it do anything else to the text? – doub1ejack Feb 24 '15 at 19:04
  • I see that it replaces dos carriage returns with newlines. But only when they are at the end of a line. I've got `^M` characters sprinkled throughout lines of text... – doub1ejack Feb 24 '15 at 19:30
  • Ah, in your question you specifically mentioned windows line endings and then put ^M in parentheses. We assumed you meant real windows line endings which are actually \r\n. I will edit my answer accordingly. – Peter Bowers Feb 24 '15 at 19:32
1

A BSD (OS X) sed solution, assisted by ANSI C-quoted bash strings:

sed -i "" $'s/\r$/\\\n/g; s/\v/\\\n/g' article_filemakerExport.xml

Note:

  • BSD sed - unlike GNU sed - requires an argument with the -i option; so, to indicate that no backup file should be created, an empty string ("") must be passed - see below for how that explains the error you got.
  • The command replaces \r\n with \n\n rather than \n, which is what I understand you want (to get just \n, simply make the 2nd substitution string empty; to replace \r even when not followed directly by \n, remove the $ after \r).

Here's a proof of concept with sample input:

$ sed  $'s/\r$/\\\n/g; s/\v/\\\n/g' <<<$'one\vtwo\r\nthree\nfour'
one
two

three
four

(All line breaks in the output above are \n.)

  • An ANSI C-quoted string ($'...') is needed to compensate for the lack of support for escape sequences in BSD sed: the shell creates desired control characters ($'\v' creates a vertical tab (^K; $'\13' would work too), $'\r' the CR (^M), $'\n' the newline) and passes the resulting literals to sed.
  • \\\n results in a literal \ followed by a literal newline - BSD sed requires literal newlines in the replacement string to be \-escaped (and doesn't support the escape code \n).

As for why your command didn't work:

Note: It looks like your problems stem at least in part from assuming that BSD sed works the same as GNU sed, which, unfortunately, is not the case: there are many subtle and not so subtle differences - see https://stackoverflow.com/a/24276470/45375

  • The missing argument for the -i option caused sed to interpret your program as the -i argument, and your filename as the program. Since your filename starts with a, sed saw the a (append text) command, and choked on the rest of the filename (because it's not a valid a command).
  • Even fixing the missing -i option argument wouldn't have made the command work, for the reasons listed above (in short: no support for control-char. escape sequences), and also your attempt to represent a vertical tab as string ^K (in GNU sed you could have used \v directly).
Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775