0

I have some files that show the character in the image below when I open on gedit. How can I exclude it using sed command?

this is the output to the command hexdump -Cv character

00000000  0c 0a                                             |..|
00000002`

enter image description here

Miguel Silva
  • 147
  • 6
  • 1
    `hexdump -C` will give you the ASCII code. – jww Oct 30 '19 at 00:19
  • 2
    First find out what character that is. If it is in a file, then `hexdump -Cv filename`, if not, `echo "yourchar" | hexdump -Cv`. Then update your post with the actual value for the character (e.g. `0xF3`) – David C. Rankin Oct 30 '19 at 00:19
  • @DavidC.Rankin, I added output of the command to the post – Miguel Silva Oct 30 '19 at 00:30
  • Looks like [Stripping hex bytes with sed - no match](https://stackoverflow.com/questions/3435370/stripping-hex-bytes-with-sed-no-match) will get you fixed right up. – David C. Rankin Oct 30 '19 at 00:41
  • 1
    I'm not sure where that char comes from. Must be UTF-16 or something, It isn't ASCII (it would be equivalent to `(np)` which is char `12` (`0x0c`) followed by a newline. That may be difficult for `sed` to handle since `sed` is a stream editor that considers `'\n'` (`0x0x`) the end of record and not a character within the record itself. To delete just `0c 0a` you could do something like `sed ':a;s/[\x0c]//;{N;s/\n//;ba}' file`, See: [sed join lines together](https://stackoverflow.com/questions/7852132/sed-join-lines-together) – David C. Rankin Oct 30 '19 at 00:53
  • Yeah, works! you can use `sed -i 's/\f//g' file` , too. works for me! Thanks @DavidC.Rankin – Miguel Silva Oct 30 '19 at 00:59
  • Cool, the other one I came up with was `sed ':a;/[\x0c]$/{s/[\x0c]//;N;s/\n//;ba}' file`, but I like `'s/\f//g'` much better (but I can't get that to work here). Or, it works without branching as well, e.g. `sed '/[\x0c]$/{s/[\x0c]//;N;s/\n//}' file` – David C. Rankin Oct 30 '19 at 01:04
  • @David - I was thinking the the `0xc 0xa` is a `CRLF` gone sideways for some reason. `CRLF` is `0xd 0xa`. Something like [Text file with 0D 0D 0A line breaks](https://stackoverflow.com/q/6998506/608639), but different. – jww Oct 30 '19 at 01:08
  • 1
    Your guess is better than mine, I don't know that I've see `0xc` in a text file since having to put a page-break in a flat file on a VAX back in the 80s to leave room to cut (with scissors) and paste (with tape) a graph on that part of the page before copying `:)` – David C. Rankin Oct 30 '19 at 01:11
  • I'm using sed 4.4, Ubuntu 18.04, and `'s/\f//g'` works! : ) yeah, the character is a page-break (\f), this come from when I convert a PDF file to TXT using `pdftotext` command – Miguel Silva Oct 30 '19 at 01:28
  • 1
    When I use `'s/\f//g'` the `0xc` is removed, but the `0xa` that follows remains. If that isn't a problem, you are good. If you need to remove both (and only when they occur together), then something similar to what is above will fix you up. Good luck with your scripting. – David C. Rankin Oct 30 '19 at 01:34

0 Answers0