0

I'm stuck and I feel stupid.

I've got a database with Tweets which I'm exporting to a .CSV using .NET. I'd like to analyze this data using Python using Pandas and NLTK. However I'm totally stuck on the first step, which is: 'reading the CSV in Python'. This led to this soup of problems: Python open CSV file with supposedly mixed encodings?

It can't be so hard to just open a file and print the text if I'm the one creating the textfile?

I'm using the following C# code to generate the CSV file (supposedly in UTF8?)

using (FileStream fs = new FileStream(fullFileName, FileMode.Append, FileAccess.Write))
using (StreamWriter sw = new StreamWriter(fs, Encoding.UTF8))

According to chardet the encoding is: ISO-8859-2.

A little hint in the right direction would be greatly appreciated...

Community
  • 1
  • 1
Ropstah
  • 17,538
  • 24
  • 120
  • 194
  • 1
    You may want to read this while you wait for an answer: http://stackoverflow.com/questions/191359/how-to-convert-a-file-to-utf-8-in-python – Day Davis Waterbury Feb 18 '15 at 21:06
  • I appreciate your comment, however I already tried some encoding/decoding steps but they all produced unwanted results. I'm asking this question to be able to avoid these steps and just open the textfile as is... – Ropstah Feb 18 '15 at 21:08
  • The link you posted also involves creating an entirely new file. I want to use the file I deliver... – Ropstah Feb 18 '15 at 21:17
  • Ok I managed to transcode the file from `ISO-8859-2` to `UTF8`. However then it breaks again over some other character... I then tried `ISO-8859-1` as source encoding and that seems to work!. But how the h*ll am I supposed to know this without trial and error? – Ropstah Feb 18 '15 at 21:27
  • And now I'm able to print the CSV to screen, but now Pandas can't read the file due to incorrect encoding... aarggg – Ropstah Feb 18 '15 at 21:30

1 Answers1

1

If the encoding is ISO-8859-2, try telling Python to open it with that. E.g. open('filename', encoding='iso-8859-2').

Tom Hunt
  • 916
  • 7
  • 21
  • I think the encoding is `UTF-8` as does Notepad++. I only stated that 'chardet' says it's `iso-8859-2` as I thought it might be of influence... – Ropstah Feb 18 '15 at 21:06
  • Did you **try** ISO-8859-2? Why include that information in your question if you're not going to use it? – MattDMo Feb 18 '15 at 21:07
  • Yes I tried it, along with some other encodings like `WINDOWS-1252` and `Unicode` (the latter didn't exist btw) – Ropstah Feb 18 '15 at 21:09
  • 1
    In any event, it seems the problem is with the input file, not with Python. It may be more productive to inquire into the C# code that produces it. – Tom Hunt Feb 18 '15 at 21:10
  • I posted the relevant C# code. It's just writing it as UTF8 and also Notepad++ identifies (and displays) the file as UTF8. – Ropstah Feb 18 '15 at 21:11