How do I remove all non-standard characters from a file?

Question

I had this same problem several weeks ago in bash, but now I would like a solution in python.

My input looks like this:

^MCopying non-tried blocks... Pass 1 (forwards)^M^[[A^[[A^[[Arescued:         0 B,  errsize:       0 B,  current rate:        0 B/s
   ipos:         0 B,   errors:       0,    average rate:        0 B/s
   opos:         0 B, run time:       1 s,  successful read:       1 s ago
^MFinished

I would like to remove every ^M control character and every ^[[A sequence to achieve the following desired output;

rescued:         0 B,  errsize:       0 B,  current rate:        0 B/s
   ipos:         0 B,   errors:       0,    average rate:        0 B/s
   opos:         0 B, run time:       1 s,  successful read:       1 s ago
Finished

Thus far I've tried:

def main(input=None):
    f = open(os.path.abspath(input),'r')
    file = f.read()
    f.close()
    filter(lambda x: x in string.printable, file)
    open('output', 'w').write(file)

but doing a cat -v still shows all the non-standard characters.

Using itertools.ifilter produces the same result.

See also: http://stackoverflow.com/questions/14693701/how-can-i-remove-the-ansi-escape-sequences-from-a-string-in-python — Warren Weckesser, Oct 13 '14 at 23:45
Your question should contain enough information in the question to explain what you're trying to do; just linking to a different question isn't sufficient. I've edited your question so that it (hopefully) matches what you're actually asking; if I'm wrong, please reject my edit and do it yourself. — abarnert, Oct 13 '14 at 23:59
In your updated version, how are you expecting to remove the "Copying non-tried blocks... Pass 1 (forwards)" part? That's obviously all printable characters. Are you trying to actually simulate a terminal, so you can detect that the initial line is getting overwritten later and therefore remove it? — abarnert, Oct 14 '14 at 00:03
I'm planning on handling that string after the other characters. I suppose I could handle it all in one go. — bmikolaj, Oct 14 '14 at 00:08
@p014k: If you're just trying to do something this narrowly special-purpose, why not just remove everything before `'rescued'`, remove the character right before `'Finished'`, and be done with it? — abarnert, Oct 14 '14 at 00:11

score 1 · Answer 1 · edited May 23 '17 at 11:57

1

If what you want to do is remove carriage returns (^M, or '\r' in Python terms) and complete ANSI or VT100 or whatever-you-have control sequences, filtering on string.printable is not going to do what you want. (You're also doing it wrong, as Warren Weckesser's answer explained—filter doesn't modify the string in-place, it returns a new string—and overcomplicating it a bit, but given that it's not the right logic, who cares?)

If you look at string.printable, you'll see that it contains carriage returns:

>>> '\r' in string.printable
True

So, stripping non-printable characters won't remove carriage returns.

And if you look at what your control sequences look like, like ^[[A ('\x1b[A' in Python terms), they start with an Escape character, and are then followed by a sequence of printable characters:

>>> [c.isprintable() for c in '\x1b[A']
[False, True, True]

So, when you strip out non-printable characters, that's going to remote the escape character, leaving behind the [ and A.

So, you need to write or find some code that parses control sequences so you can detect them and remove them. Which means you need to know what kind of control sequences you're trying to detect and remove.

IIRC, the rule for both VT100 and the obsolete ANSI X3.64 is pretty simple, something like this:

Escape (^[, aka \x1b)
Optionally [, followed by a sequence of "private" characters, followed by a sequence of zero or more semicolon-separated integers, followed by zero or more "intermediate" bytes (from ASCII 32-47)… which I think might be simpler to just match as a [ followed by any string of characters from ASCII 32-63 except for 58 than to try to get exactly right.
A "command" (from ASCII 64-126).

So, a regex like r'\x1b\[[ -9;-?]*[@-~]' should handle that. But since I don't know whether your data are VT100, ANSI X3.64, or "whatever happened to be in the termcaps at the time I ran some program", I can't tell you whether that's the right rule for you. All I can tell you is that this rule will work for the one example you gave, ^[[A.

edited May 23 '17 at 11:57

Community

1
1

answered Oct 13 '14 at 23:42

abarnert

354,177
51
601
671

Sorry, I deleted my answer. It didn't answer the real question. – Warren Weckesser Oct 13 '14 at 23:46
I am not set on using `filter`, I just want a solution that works in python that will get rid of give me the [output desired](http://pastebin.com/raw.php?i=wfDnrELm) from [the input](http://pastebin.com/raw.php?i=Vk2i81JC). The `filter` method is just something I tried that I thought worked, but upon using `cat -v`, those characters were still there. I mentioned in Warren's deleted answer that I'd like to remove the `^MCopying non-tried blocks... Pass 1 (forwards)^M^[[A^[[A^[[A` part. – bmikolaj Oct 13 '14 at 23:46
@p014k: Why are you responding to an irrelevant footnote instead of to the main point of the answer? If you have the wrong logic, it doesn't matter how you implement that logic, it's not going to work. But I'll remove that part if it's too distracting. – abarnert Oct 13 '14 at 23:50
Meanwhile, why did you downvote my answer? This is definitely your problem, and definitely the way to solve it. It doesn't contain complete code because the problem isn't fully specified, but this should be enough for you to write your own code to match your actual data. – abarnert Oct 13 '14 at 23:54
I didn't downvote your answer, someone else did. You're continuous editing is the only confusing part. I will try removing \x1b via `re` module. Are `^M` and `\r` equivalent in python? – bmikolaj Oct 13 '14 at 23:59
@p014k: I don't think you're getting the point. "Removing `\x1b` via `re` module" is going to have the exact same effect as removing it the way you're already doing it—it's going to leave the `[A` behind. You don't want that, do you? – abarnert Oct 14 '14 at 00:00
@p014k: Meanwhile, `^M` and `'\r'` aren't equivalent in Python, because `^M` doesn't mean anything in Python; it's just two normal characters. But `^M` in `cat` is equivalent to `\r` in Python—it's how each one displays the carriage return characters. – abarnert Oct 14 '14 at 00:01
Sorry. Meant to say removing `\x1b[A` – bmikolaj Oct 14 '14 at 00:02
@p014k: I think I've got the exact rule for VT100 and ANSI slightly wrong, so scrap my existing regex; I've edited in a new one which should be simpler and probably still good enough. – abarnert Oct 14 '14 at 00:07

score 1 · Accepted Answer · answered Oct 14 '14 at 00:43

If you're not actually trying to remove all control sequences, just the specific ^M and ^[[A sequences from that specific input, you can do that in two simpler ways.

First, just replace those sequences:

text = text.replace('\r', '').replace('\x1b[A', '')

Or, second—which seems more complicated, but it lets you take care of the other part you haven't gotten to yet (removing all the printable stuff between the first two ^Ms)—you could just remove everything before 'rescued', then remove the character right before 'Finished':

# partition on the first 'rescued', drop the prefix, re-join the rest
text = ''.join(text.partition('rescued')[1:])
# partition on the last 'Finished', drop the last char of the prefix, re-join
bits = text.partition('Finished')
text = ''.join(bits[0][:-1], bits[1], bits[2])

Or, with a regular expression:

text = ''.join(re.search(r'(rescued.*?)\r(Finished.*)', text, re.DOTALL).groups())

The (rescued.*?) matches everything from rescued up to but not including the next \r, then the (Finished.*) matches everything after that from Finished to the end (I'm not sure whether that's nothing, or a newline); join those two capture groups together, and you've got what you wanted.

wenzul · Answer 3 · 2014-10-14T00:58:10.827

-1

You have to grab the filter result in a variable.

Anyway I would use a simple RegEx approach.

import re, os

with open(os.path.abspath(input), 'r') as f:
    match = re.search("rescued:.*Finished", f.read(), re.MULTILINE|re.DOTALL)
    if match:
        data = match.group(0).replace("^M","")
        open('output', 'w').write(data)

edited Oct 14 '14 at 00:58

answered Oct 13 '14 at 23:48

wenzul

3,948
2
21
33

This is the same as Warren's deleted answer, and it doesn't answer the question. – abarnert Oct 13 '14 at 23:49
Ok. Life would be easier if there is a binary file so that we don't have to painful create this file our own. – wenzul Oct 14 '14 at 00:10
I'm not sure how parsing text out of some random binary file would be easier than parsing it out of text plus control characters. Then there could be _anything_ between the strings we want, and we'd only have heuristics to try to guess at things (which would probably work as well as the `strings` program, at best…). – abarnert Oct 14 '14 at 00:13
With binary file I meant text plus unescaped control characters like he will process the data. I don't know if the input is escaped or unescaped right now. – wenzul Oct 14 '14 at 00:20
I'm pretty sure that's exactly what he has. That's why he needs the `-v` flag in the first place; if they were escaped in some way, they'd all be printable and visible without it. – abarnert Oct 14 '14 at 00:33
@p014k may you look on my answer and vote it up again if it satisfies. – wenzul Oct 14 '14 at 02:35

How do I remove all non-standard characters from a file?

3 Answers3