2

I need to remove some unknown characters and remaining empty lines from a file, it should be simple and I'm feeling really stupid that I couldn't do it yet.

Here's the file contents (readable):

    136;2014-09-07 13:41:25;2014-09-07 13:41:55
    136;2014-09-07 13:41:55;2014-09-07 13:42:25
    136;2014-09-07 13:42:25;2014-09-07 13:42:55
    (empty line)
    (empty line)

For some reason, this file comes with several unwanted/unknown chars. The HEX is:

    fffe 3100 3300 3600 3b00 3200 3000 3100 3400 2d00 3000 3900  :..1.3.6.;.2.0.1.4.-.0.9.
    2d00 3000 3700 2000 3100 3300 3a00 3400 3100 3a00 3200 3500  :-.0.7. .1.3.:.4.1.:.2.5.
    3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000  :;.2.0.1.4.-.0.9.-.0.7. .
    3100 3300 3a00 3400 3100 3a00 3500 3500 0d00 0a00 3100 3300  :1.3.:.4.1.:.5.5.....1.3.
    3600 3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700  :6.;.2.0.1.4.-.0.9.-.0.7.
    2000 3100 3300 3a00 3400 3100 3a00 3500 3500 3b00 3200 3000  : .1.3.:.4.1.:.5.5.;.2.0.
    3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300 3a00  :1.4.-.0.9.-.0.7. .1.3.:.
    3400 3200 3a00 3200 3500 0d00 0a00 3100 3300 3600 3b00 3200  :4.2.:.2.5.....1.3.6.;.2.
    3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300  :0.1.4.-.0.9.-.0.7. .1.3.
    3a00 3400 3200 3a00 3200 3500 3b00 3200 3000 3100 3400 2d00  ::.4.2.:.2.5.;.2.0.1.4.-.
    3000 3900 2d00 3000 3700 2000 3100 3300 3a00 3400 3200 3a00  :0.9.-.0.7. .1.3.:.4.2.:.
    3500 3500 0d00 0a00 0000 0d00 0a00                           :5.5...........

So, as you can see the first 2 bytes are xFF and xFE and there are many x00 after each char. The line endings are a join of 0D00 + 0A00, carriage return and linefeed (\r\n) plus the x00.

I wanted to remove those x00 and the first 2 bytes xFFxFE and the last 4, and convert the CRLF to LF.

I could do that by using head, tail and tr:

    tr -d '\15\00' < 2014.log | tail -c +3 | head -c -2 > 3.log

The problem is, I'm not sure if the file will always arrive like this, so I need to build a more generic method. I ended up with:

    sed 's/\xFF\xFE//g; s/\x00//g; s/\x0D//g' 2014.log > 2.log
    or
    tr -d '\377\376\00\15' < 2014.log > 2.log

Now I need to remove the last two empty lines, which as I said in the beginning, should be easy, but I can't accomplish that.

I've tried:

    sed '/^\s*$/d'
    sed '/^$/d'
    awk 'NF > 0'
    egrep -v "^$"
    Other stuff

But in the end it removes only one of the blank lines, I still have one x0A in the end. I tried to replace the join of two x0Ax0A with sed, even using \n\n but it didn't work. I can't remove all \n because I need the normal lines, I just want to remove when they appear at least two times in sequence. Again I could use tail or head to remove it, but I would be assuming that all files would arrive that way, and its not true.

I see it as a simple find and replace stuff, but it seems it doesn't work that way when we are working with linefeeds.

For information purposes:

    file -i 2014-09-07-13-46-51.log
    2014-09-07-13-46-51.log: application/octet-stream; charset=binary

Its not been recognized as a text file... this file is extracted from a flash shared object (.sol).

As the new files may not be like this and arrived as normal text files, I can't simple cut the files, but I need to treat those who are problematic.

kenorb
  • 155,785
  • 88
  • 678
  • 743
Luciano Serra
  • 95
  • 1
  • 7
  • 3
    That looks like UTF-16 with a BOM. Try opening the file in something that can handle that encoding. Then see if you can convert it to a better encoding for your purposes. – Etan Reisner Sep 09 '14 at 15:21
  • I think you're right, it seems to be an UTF-16 with BOM, I tried to convert it first: iconv -f UTF-16 -t UTF-8, it removed those first bytes and the 00, but the last bytes gets messed, maybe the file comes corrupted, 0d 0a00 0d0a – Luciano Serra Sep 09 '14 at 15:34
  • What is the "corruption" exactly? There does appear to be a random `NUL` character in there that I'm not sure about that could be throwing things off I guess. Recreating that file here seems to convert correctly but there is a random `NUL` byte on the last line. – Etan Reisner Sep 09 '14 at 15:48
  • Yes, there's a NUL (x00) between the last two CRLF - 0d0a 00 0d0a - no problem with that, I just need to remove it all from the file, empty lines plus this nul, and the last linefeed – Luciano Serra Sep 09 '14 at 16:06
  • It is simple enough to post-process the converted file to drop that set of three bytes from the end of the file. – Etan Reisner Sep 09 '14 at 16:11
  • Just to make it more clear, I have thousands of those logs files that I need to import and can't lose them, so I cant assume that all new files are gonna be the same, that why I'm trying to build a method that wont change files straight – Luciano Serra Sep 09 '14 at 16:12
  • I didn't see your last answer before I posted mine. The problem is that new log files may come 'corrected', I mean, with the right encode and no ending NUL chars or wrong linefeeds (it should be like this since the beginning, but unfortunately I already have thousands of those files to import). I can't cut the last bytes right away. Give me a minute, I will try to remove them and give you a feedbacxk – Luciano Serra Sep 09 '14 at 16:15
  • I'm sorry, my rep is low yet, so I have to post here. As the original question, it would be easy if I could just find-replace those last bytes. I tried again with sed: iconv -f UTF-16 -t UTF-8 2014.log | sed 's/\x0d\x0a\x00\x0d\x0a//g' > 4.log - but again, it doesn't seem to work with x0a - \n – Luciano Serra Sep 09 '14 at 16:31
  • I could made it, but I didn't like the solution... well, here it is: I convert linefeeds to another char with tr, then I remove the ones I want (those that appear more than once in sequence) and then convert it back: tr '\n' '|' | sed 's/||//g;' | sed 's/|/\x0A/g' – Luciano Serra Sep 09 '14 at 16:42
  • sed operates on a line basis so can't handle operations on newlines like that. Use a tool that can. awk with the appropriate settings of `RS` or `ed` or just about any programming language. – Etan Reisner Sep 09 '14 at 17:24

5 Answers5

1

The "fffe" at the beginning of the file is a byte order mark (http://en.wikipedia.org/wiki/Byte_order_mark) and for me an indication that you have a unicode type file. In that kind of file 'normal' ascii characters are represented by 2 bytes.

In another stackoverflow question/aswer the file is first converted to UTF-8... (grepping binary files and UTF16)

Community
  • 1
  • 1
Eddy
  • 35
  • 5
  • Thank you for the info about the byte order mark! As I told Etan on the other comment, I run iconv to convert: iconv -f UTF-16 -t UTF-8, the file is more 'readable' - although it won't open properly on some editors because the ending bytes get messed: 0d 0a00 0d0a – Luciano Serra Sep 09 '14 at 15:38
1

I finally made it, but really didn't like the solution. I've replaced all linefeeds with another character, like pipe (|), then removed then when I found two in sequence (||), and then convert pipes (|) back to \n

sed 's/\xFF\xFE//g; s/\x00//g; s/\x0D//g' 2014.log | tr '\n' '|' | sed 's/||//g;' | sed 's/|/\x0A/g' > 5.log

-- @Luciano

kenorb
  • 155,785
  • 88
  • 678
  • 743
1

Wow I solved the problem by that time but forgot to answer, so here it is!

Using only tr command I could accomplish that like this:

tr -d '\377\376\015\000\277\003' < logs.csv | tr -s '\n'

tr removed all the unwanted characters and empty lines, and it was really, really fast, much faster than the options using sed and awk

Luciano Serra
  • 95
  • 1
  • 7
0

If you just want the ASCII characters out of the file you might try iconv

You probably can identify the file's encoding with file -i

dawg
  • 98,345
  • 23
  • 131
  • 206
  • file -i 2014-09-07-13-46-51.log 2014-09-07-13-46-51.log: application/octet-stream; charset=binary The problem is that those files are extracted from a flash shared object (.sol), and its arriving corrupted already – Luciano Serra Sep 09 '14 at 15:54
0

I know you asked for sed, tr or awk but on the offchance it will change your mind, this is how easy it is to get Perl to do the heavy lifting:

perl -e 'open my $fh, "<:encoding(utf16)", $ARGV[0] or die "Error reading $ARGV[0]: $!"; while (<$fh>) { s{\x0d\x0a}{\n}g; s{\x00\n}{}g; print $_; }' input_filename
Tim
  • 9,171
  • 33
  • 51
  • I'll give it a try tomorrow and give you a feeback! I may have expressed myself badly, I don't necessarily need to use those 3 commands, I just need something that do the job on debian – Luciano Serra Sep 09 '14 at 23:12