0

I downloaded a dataset from Kaggle and am trying to execute the following code:

import pandas as pd
movie_data = pd.read_csv('moviemetadata.csv', encoding = 'utf-8', delimiter = ',', header=0, decimal = '.')
print(movie_data.info)

Curiously, when I'm trying to run it inside Sublime Text or the Terminal (I'm on a Mac), it won't work and the following error gets thrown out:

Traceback (most recent call last):
File ".../test.py", line 14, in <module>
print(movie_data.info) #UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 7356: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 7559: ordinal not in range(128)

I googled this error message and tried to find a fix, for example by including the "encoding = 'utf-8'", but this didn't fix it. I then tried to run the same code in jupyter, and it works flawlessly. I get exactly the output I want.

Does somebody have an idea what causes this and how I could get the same code to work in the terminal also?

Additional info: I'm using the same Python version in terminal and jupyter, and I saved the .csv specifically with utf-8 encoding.

azureai
  • 181
  • 1
  • 13
  • 1
    Are you using the same Python version in Jupyter and on the terminal? – eugenhu Nov 16 '17 at 14:00
  • Yep probably 2.7 in terminal (comes with MAC). I made a quick google and found that `encoding = 'utf8'` without the dash. Did you try that? Probably should work with utf-8 though. – Anton vBR Nov 16 '17 at 14:04
  • `encoding = 'latin-1` maybe.. – cs95 Nov 16 '17 at 14:05
  • @eugenhu Yes. 'import sys; sys.executable' gives the same result in both jupyter and terminal. – azureai Nov 16 '17 at 14:08
  • @AntonvBR I tried, same error message. 'latin-1' didn't work either. – azureai Nov 16 '17 at 14:09
  • Does the code work on your terminal without `print(movie_data.info)`? – eugenhu Nov 16 '17 at 14:29
  • @eugenhu It does. – azureai Nov 16 '17 at 14:37
  • The issue might be with printing to the terminal, and the csv could be correctly parsed. You could try using `print(movie_data.info.encode('utf-8', 'ignore'))`. – eugenhu Nov 16 '17 at 14:39
  • @eugenhu This gives me an 'AttributeError: 'function' object has no attribute 'encode'' error message. – azureai Nov 16 '17 at 14:40
  • Oh ok right, is `movie_date.info` a Series? If so, you can try manually printing out each row by iterating through them, and use `.encode('utf-8', 'ignore'))` for when printing out text. – eugenhu Nov 16 '17 at 14:42
  • Or actually just try `print(movie_data.info.apply(lambda x: x.encode('utf-8', 'ignore')))`. – eugenhu Nov 16 '17 at 14:48
  • Is that a SyntaxError? – eugenhu Nov 17 '17 at 15:09
  • @eugenhu Sorry. So I copied 'print(movie_data.info.apply(lambda x: x.encode('utf-8', 'ignore')))' verbatim this time and get "'function' object has no attribute 'apply'". It works if I write it without the '.apply', but the output is incomplete and only a fraction of that in jupyter. – azureai Nov 17 '17 at 17:24
  • @eugenhu Ok, this seems to be another issue which is separate from my original problem. So regarding my original question: The terminal can't handle whatever encoding the file originally had and giving an explicit encoding encodes the data as utf-8 so that it works? – azureai Nov 17 '17 at 17:32
  • 1
    Yes, you can see [this question](https://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python?rq=1) which has a similar problem. Also, do you have a column named 'info' in your dataframe? I've just realised that this refers to a method [`DataFrame.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html). If you want to refer to the column you will need to do `movie_data["info"]`, and `apply()` should work on it afterwards. – eugenhu Nov 18 '17 at 01:56
  • @eugenhu I think I see now what my problem was. I really want the output of '.info()', which gives me some meta-information about the dataframe. Instead, I wrote '.info', which shouldn't work since .info is a method. Terminal then correctly gives an error-message, because trying to print a method shouldn't work I think. Though why it says this is an encoding error is beyond me. When using your workaround with the lambda, I get the same output as when using 'print(movie_data.info())', without any encoding specified, which is the output I'm interested in. – azureai Nov 18 '17 at 09:13
  • And for some inexplicable reason, jupyter gives an output with the '.info' without parenthesis (which is different from the .info()) output), which got me thinking that there must be something wrong with the terminal output. – azureai Nov 18 '17 at 09:20
  • Yeah, Jupyter must have shown something like ` – eugenhu Nov 18 '17 at 10:28
  • @eugenhu That is exactly what it showed. Why does jupyter do this? – azureai Nov 18 '17 at 11:13
  • Essentially when you do `print(x)` on something that isn't a string, such as on a [`MethodType`](https://docs.python.org/3/library/types.html#types.MethodType) which is what all user-defined methods of class instances are, `x` is first converted to a string via [`str(x)`](https://docs.python.org/3.4/library/stdtypes.html#str) or [`repr(x)`](https://docs.python.org/3.4/library/functions.html#repr). What you saw as the output in Jupyter was just the way `MethodType` is represents itself as a string. – eugenhu Nov 18 '17 at 11:32
  • The behaviour is the same as on the terminal it's just you didn't get to see it because of the error. – eugenhu Nov 18 '17 at 11:33
  • 1
    @eugenhu I see. If you care to put this as an answer I will accept it. Thank you very much for your time. – azureai Nov 20 '17 at 20:57

1 Answers1

1

You just have to do:

movie_data.info()

Since info() is a method.

There's also no need to wrap it in a print call as info() already outputs to sys.stdout by default.


The reason why print(movie_data.info) worked on Jupyter and not on your terminal is likely due to encoding issues. When you try to print a non-string type, print will essentially attempt to convert the object to a string with str() or repr(). Since movie_data.info is a MethodType, i.e. a bound method, repr(movie_data.info) will look something like <bound method DataFrame.info of ...> where ... is the string representation of your dataframe. And because your dataframe contains some unicode values, so will its string representation; which if not properly encoded before piping to stdout, might give you an encoding error. (see also)

The general summary output of info() doesn't appear to include any cell or index values, but just column names. Unless your dataframe columns might also have unicode characters, you can just do movie_data.info(), otherwise, something like this to encode the columns first should also work:

movie_data.columns = map(lambda s: s.encode('utf-8', 'ignore'), movie_data.columns)
movie_data.info()
eugenhu
  • 1,168
  • 13
  • 22
  • 1
    As I've written in the comments, 'movie_data.info()' is actually what I wanted. This is not to take from your explanation which is very informative, it was just that since 'movie_data.info' gave an output in jupyter I assumed there must be something wrong with my terminal. Your explanation cleared it all up for me. – azureai Nov 21 '17 at 19:42
  • @azureai Oh yeah that's right, my bad must have skimmed over that, I've edited my answer anyway for completeness sake. It's a little bit different but basically you can just have `movie_data.info()` without putting it in a `print` if you're not already doing that. – eugenhu Nov 21 '17 at 21:24