I'm scraping a set of originally pdf files, using Python. Having gotten them to text, I had a lot of trouble getting the line endings out. I couldn't figure out what the line separator was. The trouble is, I still don't know.
It's not a '\n'
, or, I don't think, '\r\n'
. However, I've managed to isolate one of these special characters. I literally have it in memory, and by doing a call to my_str.replace(eol, '')
, I can remove all of these characters from one of my files.
So my question is open-ended. I'm a bit lost when it comes to unicode and such. How can I identify this character in my files without resorting to something ridiculous, like serializing it and then reading it in? Is there a way I can refer to it as a code, perhaps? I can't get Python to yield what it actually IS. All I ever see if I print it, or call unicode(special_eol)
is the character in its functional usage as a newline.
Please help! Thanks, and sorry if I'm missing something obvious.