1

I have file that is read by application in unix and windows. However I am encountering problems when reading in windows with ^M in the middle of the data. I am only wanting to remove the ^M in the middle of the lines such as field 4 and field 5.

I have tried using perl -pe 's/\cM\cJ?//g' but it removes everything into one line which i don't want. I want the data to stay in the same line but remove the extra ones

# Comment^M
# field1_header|field2_header|field3_header|field4_header|field5_header|field6_header^M
#^M
field1|field2|field3|fie^Mld4|fiel^Md5|field6^M
^M

Thanks

jamessan
  • 41,569
  • 8
  • 85
  • 85
5hak3y
  • 11
  • 1
  • 2

3 Answers3

1

To just remove CR in the middle of a line:

perl -pe 's/\r(?!\n)//g'

You can also write this perl -pe 's/\cM(?!\cJ)//g'. The ?! construct is a negative look-ahead expression. The pattern matches a CR, but only when it is not followed by a LF.

Of course, if producing a file with unix newlines is acceptable, you can simply strip all CR characters:

perl -pe 'tr/\015//d'

What you wrote, s/\cM\cJ?//g, strips a CR and the LF after it if there is one, because the LF is part of the matched pattern.

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • Don't use octal, much too confusing and error prone ... which you prove yourself by using `\010` instead of `\012`. Use `\r` and `\n`, much clearer. – mscha May 22 '11 at 19:22
  • @mscha: `\n` and company are subtly different: they mean “whatever is CR on this platform”, not a particular byte value. This is admittedly mostly theoretical, since `\r` is ASCII CR and `\n` is ASCII LF on both Windows and unix, and other platforms are highly marginal. – Gilles 'SO- stop being evil' May 22 '11 at 19:45
  • On Windows a "\n" is "\015\012" On Unix/Linux a "\n" is "\012" You should change your post and remove "\n". – David Raab May 23 '11 at 10:08
  • @Sid: No, `\n` is just `\012` on Windows. (Try `print unpack("H*", "\n")`.) What happens on a Windows perl is that *when reading or writing a file* that's opened as text (such as STDIN/STDOUT), a CRLF (`\015\012`) newline is converted from/to `\012`. So on a Windows perl, `-pe 'tr/\r//d'` and `-pe 's/\r(?!\n)//g'` have the same effect (they wouldn't if the file had been opened in binary mode). – Gilles 'SO- stop being evil' May 23 '11 at 17:47
0

Sounds like the easiest solution might be to check your filetype before moving between unix and windows. dos2unix and unix2dos might be what you really need, instead of a regex.

I'm not sure what character ^M is supposed to be, but carriage return is \015 or \r. So, s/\r//g should suffice. Remember it also removes your last carriage return, if that is something you wish to preserve.

TLP
  • 66,756
  • 10
  • 92
  • 149
0
use strict;
use warnings;

my $a = "field1|field2|field3|fie^Mld4|fiel^Md5|field6^M";

$a =~ s/\^M(?!$)//g;

print $a;
Tudor Constantin
  • 26,330
  • 7
  • 49
  • 72
  • Did you try your code and look at the output? It's removing the character after the ^M too. – ysth May 22 '11 at 16:24
  • @ysth - you were right, thx. I edited the response now and put a negative lookahead for the end of line (its removing only ^M that is not followed by an EOL) – Tudor Constantin May 23 '11 at 03:54