How to recognize special eol character when I see it, using Python?

Question

I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.

It's not a '\n', or, I don't think, '\r\n'. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, ''), I can remove all of these characters from one of my files.

So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol) is the character in its functional usage as a newline.

Please help! Thanks, and sorry if I'm missing something obvious.

`print repr(eol)` and / or `print eol.encode('unicode_escape')` and / or `print ord(eol)`, then show us the output. — Martijn Pieters, Sep 25 '13 at 06:39
**All** unicode codepoints can be represented by an unicode escape sequence, but without more detail we cannot tell you what escape code is the right one. — Martijn Pieters, Sep 25 '13 at 06:45
Wow, you just knew exactly what I needed to type. Um, I'm a bit embarrassed, as it is a '\n'. I don't understand, because I tried `str.rstrip('\n')`. — Brian Peterson, Sep 25 '13 at 06:45
`.rstrip('\n')` only removes the newline from the end of a string; perhaps it was present elsewhere in the string as well? Take a look at [`str.splitlines()`](http://docs.python.org/2/library/stdtypes.html#str.splitlines) as well. — Martijn Pieters, Sep 25 '13 at 06:48
Well, this solves my confusion anyway. Since I do want to remove all of them, `.replace('\n', '')` is sufficient. Thanks a lot! — Brian Peterson, Sep 25 '13 at 06:51
For the record, `repr()` and `.encode('unicode_escape)` were exactly what I was looking for. — Brian Peterson, Sep 25 '13 at 06:53
I always leave the caveat at the end of my questions, 'sorry if I'm missing something obvious', because I often am. — Brian Peterson, Sep 25 '13 at 07:04
NP, if that is what helped, I turned that into an answer. :-) — Martijn Pieters, Sep 25 '13 at 07:05

score 2 · Accepted Answer · answered Sep 25 '13 at 07:04

2

To determine what specific character that is, you can use str.encode('unicode_escape') or repr() to get (in Python 2) a ASCII-printable representation of the character:

>>> print u'☃'.encode('unicode_escape')
\u2603
>>> print repr(u'☃')
u'\u2603'

answered Sep 25 '13 at 07:04

Martijn Pieters

1,048,767
296
4,058
3,343

I used this again later with a different weird character I needed to pull out, which had a nice '\x0c' utf-8 representation. – Brian Peterson Sep 25 '13 at 08:36

How to recognize special eol character when I see it, using Python?

1 Answers1

Linked