Python: how to get rid of non-ascii characters being read from a file

Question

I am processing, with python, a long list of data that looks like this

The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)

29/07/2016 04:00:12 0.125143

Now, when I read such file into a script using something like open and readlines, there is an error, reading

SyntaxError: EOL while scanning string literal

I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.

these might help you https://stackoverflow.com/questions/64749/m-character-at-end-of-lines https://stackoverflow.com/questions/16695950/how-to-read-windows-file-in-linux-environment — Equinox, Jul 12 '17 at 08:31

score 1 · Answer 1 · answered Jul 12 '17 at 08:03

1

Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:

re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)

answered Jul 12 '17 at 08:03

bakatrouble

1,746
13
19

Thank you for giving more information, but sorry I don't have time to thoroughly test this (but I have upvoted you). – Violapterin Jul 28 '17 at 16:06

score 0 · Accepted Answer · answered Jul 28 '17 at 16:05

I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)

def extract_numbers(str_i):
   pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
   match_h = re.findall(pat, str_i)
   return match_h[0]

# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
   ls_f =extract_numbers(l)
   # process them....

Python: how to get rid of non-ascii characters being read from a file

2 Answers2