Identify and remove specific hidden characters from text file

Question

I have a text file that contains several hidden characters. Using cat -v I am able to see that they include the following;

^M

^[[A

There are also \n characters at the end of the line. I would like to be able to display these as well somehow.

Then I would like to be able to selectively cut and sed these hidden characters. How would I go able accomplishing this?

I've tried dos2unix but that didn't help remove any of the ^M characters. I've also tried sed s/^M//g wherein I pressed ctrl+v m.

Raw data

Output from cat -v on the raw data, also available at: http://pastebin.com/Vk2i81JC

^MCopying non-tried blocks... Pass 1 (forwards)^M^[[A^[[A^[[Arescued:         0 B,  errsize:       0 B,  current rate:        0 B/s
   ipos:         0 B,   errors:       0,    average rate:        0 B/s
   opos:         0 B, run time:       1 s,  successful read:       1 s ago
^MFinished

Output wanted

Also available at: http://pastebin.com/wfDnrELm

rescued:         0 B,  errsize:       0 B,  current rate:        0 B/s
   ipos:         0 B,   errors:       0,    average rate:        0 B/s
   opos:         0 B, run time:       1 s,  successful read:       1 s ago
Finished

sed ^M will failed if you only work line by line because it is part of the line separator. You need to load at least 2 line with a `N` (one of the method) before using your `s` command. In this case i certainly load all the file in the working buffer before starting the `s` once — NeronLeVelu, Sep 11 '14 at 06:01
The `^[[A` sequences are probably terminal control codes. Removing them is complex; there can be varying numbers of characters after the initial `^[` (escape) character. — Jonathan Leffler, Sep 11 '14 at 14:47
Are you on Windows or Unix? Particularly with the control-M characters, it matters — on Windows, control-M is an important part of the line ending in normal text files (two characters, control-M or CR and control-J or LF mark the end of a line). On Unix, the CR characters are far less important, not being required except by Internet standard protocols or Windows compatibility. — Jonathan Leffler, Sep 11 '14 at 14:52
@Jonathan Leffler Linux. The text file was generated via a `ddrescue` logfile. — bmikolaj, Sep 11 '14 at 14:53
That makes life a whole lot easier! You mention in a [comment](http://stackoverflow.com/questions/25778587/?noredirect=1#comment40335524_25779386) that you want to remove text surrounded by `^M`. Can you give more context? Are these within a line, or do they span lines? `sed` is line-based; `tr` is character-based; if you need cross-line matching, you probably need to move to Perl or Python. With luck, you won't need to do that. — Jonathan Leffler, Sep 11 '14 at 14:56
In another [comment](http://stackoverflow.com/questions/25778587/identify-and-remove-specific-hidden-characters-from-text-file?noredirect=1#comment40337736_25779386) I link to examples. The `^M` surrounds "Copying non-tried blocks..." — bmikolaj, Sep 11 '14 at 15:09

Ram · Accepted Answer · 2014-09-11T19:02:21.177

8

Try the below tr command which is used to translate or delete characters. The below command removes all the characters other than the one specified in octal within the quotes

octal \12 - new line(\n), octal \11 - TAB(^I), octal \40-\176 - are good characters.

For a complete reference of octal values refer to this page: https://courses.engr.illinois.edu/ece390/books/labmanual/ascii-code-table.html

tr -cd '\11\12\40-\176' < org.txt > new.txt

The file new.txt will contain the characters removed.

To remove the characters between ^M and remove the unnecessary control characters use the below command

sed "s/\r.*\r//g" org.txt | tr -cd '\11\12\40-\176' > new.txt

edited Sep 11 '14 at 19:02

answered Sep 11 '14 at 04:59

Ram

1,115
8
20

This worked fairly well. I can now remove parts of the file I don't want, but I would still like to be able to cut using `^M` as a delimiter. I have text surrounding two `^M` that would I like to remove as well. – bmikolaj Sep 11 '14 at 14:09
1

Can you give an example file or sample input and expected output ? – Ram Sep 11 '14 at 14:12
1

I'm curious that you think `\11` (aka decimal 9, or control-I, or tab) is called 'form feed'; that is normally `\14` (aka decimal 12, control-L, or form feed). I was going to write that I wouldn't choose to keep form feed and would choose to keep tab, but then realized that your code actually does keep tab. – Jonathan Leffler Sep 11 '14 at 14:49
Here is a snippet example generated via `cat -v`: http://pastebin.com/Vk2i81JC And here is how it should look and how it appears via `cat`: http://pastebin.com/wfDnrELm – bmikolaj Sep 11 '14 at 14:59
@Jonatha Leffler , Yes you are right \11 is tab , \14 is form feed,I will correct the answer and thanks for pointing that – Ram Sep 11 '14 at 15:51
@p014k please try the command sed "s/\r.*\r//g" org.txt | tr -cd '\11\12\40-\176' > new.txt , the 1st sed command tries to remove the pattern ^Masbdad^M and then the tr command removes all the other unnecessary control characters – Ram Sep 11 '14 at 15:58
@pO14k I have updated the answer to work as per you requirements – Ram Sep 11 '14 at 19:04

Identify and remove specific hidden characters from text file

Raw data

Output wanted

1 Answers1

Linked