0

I am writing a script which parses an apache log file into a pandas table. Now I recognized that in the log file the text is sometimes like this / with utf8 values in it:

Bern,%20Bahnhof

For example one log file text line:

IP - - timestamp "GET /v1/connections?from=Bern,%20Bahnhof&to=Luzern HTTP/1.1" httpstatus bytes -"

My current code to open the log files:

cols = ['ip','l','userid','timestamp','tz','fullrequest','status','bytes','referer','useragent']

df = pd.read_csv(path + file, delim_whitespace=True, names=cols, error_bad_lines=False, encoding='utf8').drop('l', 1)
df = df.drop('userid', 1)

Is there a way to parse the log files into pandas, so that these strange chars are converted into latin chars?

So that in the end we have somthing like this:

Bern, Bahnhof
timo
  • 21
  • 2
  • It's called "URL encoding"; you can look it up easily. โ€“ Matteo Italia Dec 03 '17 at 13:02
  • Regular 7-bit ASCII and UTF-8 are identical in the range 0x00-0x7F so it's not *wrong* to say that `%20` is UTF-8, but it's not particularly that; it's also the same character (space) in a large number of othe encodings, including Latin-1 and KOI-8R. "Real" percent-encoded UTF-8 would look like `%C3%A4` which corresponds to the Unicode character *รค* ([U+00E4](https://www.fileformat.info/info/unicode/char/00e4/index.htm)) โ€“ tripleee Dec 03 '17 at 13:17
  • Unicode and Latin-1 are identical in the range 0xA0-0xFF so no actual conversion is necessary really if that's actually what you want. But I expect you really are trying to find out how to decede percent-encoding. โ€“ tripleee Dec 03 '17 at 13:20

0 Answers0