I need to remove some unknown characters and remaining empty lines from a file, it should be simple and I'm feeling really stupid that I couldn't do it yet.
Here's the file contents (readable):
136;2014-09-07 13:41:25;2014-09-07 13:41:55
136;2014-09-07 13:41:55;2014-09-07 13:42:25
136;2014-09-07 13:42:25;2014-09-07 13:42:55
(empty line)
(empty line)
For some reason, this file comes with several unwanted/unknown chars. The HEX is:
fffe 3100 3300 3600 3b00 3200 3000 3100 3400 2d00 3000 3900 :..1.3.6.;.2.0.1.4.-.0.9.
2d00 3000 3700 2000 3100 3300 3a00 3400 3100 3a00 3200 3500 :-.0.7. .1.3.:.4.1.:.2.5.
3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000 :;.2.0.1.4.-.0.9.-.0.7. .
3100 3300 3a00 3400 3100 3a00 3500 3500 0d00 0a00 3100 3300 :1.3.:.4.1.:.5.5.....1.3.
3600 3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700 :6.;.2.0.1.4.-.0.9.-.0.7.
2000 3100 3300 3a00 3400 3100 3a00 3500 3500 3b00 3200 3000 : .1.3.:.4.1.:.5.5.;.2.0.
3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300 3a00 :1.4.-.0.9.-.0.7. .1.3.:.
3400 3200 3a00 3200 3500 0d00 0a00 3100 3300 3600 3b00 3200 :4.2.:.2.5.....1.3.6.;.2.
3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300 :0.1.4.-.0.9.-.0.7. .1.3.
3a00 3400 3200 3a00 3200 3500 3b00 3200 3000 3100 3400 2d00 ::.4.2.:.2.5.;.2.0.1.4.-.
3000 3900 2d00 3000 3700 2000 3100 3300 3a00 3400 3200 3a00 :0.9.-.0.7. .1.3.:.4.2.:.
3500 3500 0d00 0a00 0000 0d00 0a00 :5.5...........
So, as you can see the first 2 bytes are xFF and xFE and there are many x00 after each char. The line endings are a join of 0D00 + 0A00, carriage return and linefeed (\r\n
) plus the x00.
I wanted to remove those x00 and the first 2 bytes xFFxFE
and the last 4, and convert the CRLF
to LF
.
I could do that by using head, tail and tr:
tr -d '\15\00' < 2014.log | tail -c +3 | head -c -2 > 3.log
The problem is, I'm not sure if the file will always arrive like this, so I need to build a more generic method. I ended up with:
sed 's/\xFF\xFE//g; s/\x00//g; s/\x0D//g' 2014.log > 2.log
or
tr -d '\377\376\00\15' < 2014.log > 2.log
Now I need to remove the last two empty lines, which as I said in the beginning, should be easy, but I can't accomplish that.
I've tried:
sed '/^\s*$/d'
sed '/^$/d'
awk 'NF > 0'
egrep -v "^$"
Other stuff
But in the end it removes only one of the blank lines, I still have one x0A in the end. I tried to replace the join of two x0Ax0A with sed, even using \n\n but it didn't work.
I can't remove all \n
because I need the normal lines, I just want to remove when they appear at least two times in sequence. Again I could use tail or head to remove it, but I would be assuming that all files would arrive that way, and its not true.
I see it as a simple find and replace stuff, but it seems it doesn't work that way when we are working with linefeeds.
For information purposes:
file -i 2014-09-07-13-46-51.log
2014-09-07-13-46-51.log: application/octet-stream; charset=binary
Its not been recognized as a text file... this file is extracted from a flash shared object (.sol
).
As the new files may not be like this and arrived as normal text files, I can't simple cut the files, but I need to treat those who are problematic.