0

I've been able to feed a php function with a list of URLs (on a Raspberry Pi 3) only if the "list" is a txt file containing a single line (URL) without the ending end-of-line sign ("$"). I've tried

sed -e 's/\r$//g'

and

sed -e 's/^M//g'

but I was only able to delete the ending "$" manually within a text editor going to the last (i.e. second) line of the file and pressing backspace on the keyboard.

There's no problem splitting the master file containing hundreds of URLs into single-line files and calling php function a file-at-a-time, but there must be another easy way (sed, awk?) to delete the ending "$" at the end of the (only) line in the file.

tausun
  • 2,154
  • 2
  • 24
  • 36
d_kpr
  • 5
  • 1
  • Can you modify the PHP code to work with this as it will reduce the number of stages in processing the data. – Nigel Ren Feb 25 '20 at 07:54
  • 2
    `$` "sign" is a zero width assertion that does not consume any chars. So, you cannot expect `\r$` to ever match `CRLF`. It will only match CR before LF. `sed` replaces on a line-per-line basis, so `\n` is not in scope. To make it in available, use `-z` option with a GNU sed, `sed -Ez 's/\r?\n//g'` or simply `tr -d "\r\n" < file` – Wiktor Stribiżew Feb 25 '20 at 08:12
  • Although @NigelRen suggested to tackle the root cause, `tr -d "\r\n" < file` suggested by @Wiktor was sufficent to use the workaround. Thank you both for your quick replies and help. – d_kpr Feb 25 '20 at 14:24
  • `tr` works a character at a time. `tr -d '\r\n'` doesn't mean "remove all `\r\n` strings", it means "remove every `\r` or `\n` character" and some of those might be present in your input in a context other than as a newline so it's not a robust approach to removing newlines. See https://stackoverflow.com/q/45772525/1745001 for some better approaches. – Ed Morton Feb 25 '20 at 14:44

1 Answers1

0

There is no $ in your file. $ is a symbol used to indicate end-of-string in a regular expression (just like ^ means start-of-string). In a tool that operates one line at a time the end of the string it's working on is also the end of the line so often people using line-oriented tools mis-state $ as meaning end-of-line since in the context of that tool it's the same thing. $ is also used in other tools (e.g. cat -E) as an end-of-line indicator.

Some terminology/definitions:

  • \r is an escape sequence used in scripts to generate or match the CR (carriage-return) character ^M (control-M), ASCII 13
  • \n is an escape sequence used in scripts to generate or match the LF (line-feed) character ^J (control-J), ASCII 10
  • $ is a regexp meta-character used in scripts to indicate end-of-string (which often is also the end-of-line) and is also used by tools to indicate end-of-line when displaying text.
  • \n (i.e. LF alone) is considered a newline in UNIX
  • \r\n (i.e. CRLF) is considered a newline in DOS (see Why does my tool output overwrite itself and how do I fix it?)

So when you do:

$ printf 'foo\n' | cat -vE
foo$

that does not mean there's a $ at the end of foo, it's just cat displaying a $ to show you where the end of the line is. When you do:

$ printf 'foo\r\n' | cat -vE
foo^M$

the ^M (control-M) is explicitly showing you the CR (carriage-return) character generated by \r but the $ is not explicitly showing you the ^J (control-J) character that the LF (line-feed) generated by the \n, instead it's specifically displaying a different character $ to show you the end of the line. If it DID show you ^Js then everything would be concatenated on one line which would be tough to read. Consider the ease of reading this:

$ printf 'the\nquick\nbrown\nfox\n' | cat -vE
the$
quick$
brown$
fox$

vs if the output was this:

$ printf 'the\nquick\nbrown\nfox\n' | some_other_tool
the^Jquick^Jbrown^Jfox^J

You can never do either of these:

$ printf 'foo\nbar\n' | sed 's/$//' | cat -vE
foo$
bar$

$ printf 'foo\nbar\n' | sed 's/\n//' | cat -vE
foo$
bar$

to remove a LF since sed already consumed the LF when reading the input and the $ isn't itself the newline character, it's a metacharacter that lets you say in your regexp "match the end of the line" (in this case since the end of the input string IS the end of the line for sed by default).

You might ask - if sed consumed the LF when reading the input then why are there LFs at the end of each line of output? The answer is that sed adds a LF to every output line so that what it outputs is a valid POSIX text file (without terminating LFs you do not have a POSIX text file and so what any subsequent tool does with it is undefined behavior).

You can remove LFs, though, if you use a tool that does not read one line at a time. GNU sed has a -z option to read NUL-separated text instead of LF-separated text and in that mode you can remove LF characters:

$ printf 'foo\nbar\n' | sed -z 's/\n//' | cat -vE
foobar$

and now you can see how $ (the end-of-string metacharacter) is different from \n (the escape sequence to match the LF character):

$ printf 'foo\nbar\n' | sed -z 's/$//' | cat -vE
foo$
bar$

$ printf 'foo\nbar\n' | sed -z 's/\n/<LF>/' | cat -vE
foo<LF>bar$

$ printf 'foo\nbar\n' | sed -z 's/$/<EOS>/' | cat -vE
foo$
bar$
<EOS>$

So the quick answer for "how do you remove LFs with sed?" is this with GNU sed:

$ printf 'foo\nbar\n' | sed -z 's/\n//g'
foobar$

and if you don't have GNU sed (or actually even if you do since the above will read the whole input into memory at once assuming a POSIX text file without NULs as input) then you should just use awk:

$ printf 'foo\nbar\n' | awk -v ORS= '1'
foobar$
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    thank you for both your earlier comment and a very thorough explanation of the background and anatomy of tools at hand. :) – d_kpr Feb 28 '20 at 07:07