0

I'm a python beginner, so I'm practicing some data analysis using pandas in a dataframe with a list of restaurants with a Michelin star (restaurants_df).

When I show, for example, the first 5 rows I notice that in the "price" column (object type) of row 4 I have a blank value:

In [ ]: restaurants_df.head()
Out[ ]:
    name            year    latitude    longitude     city        region    zipCode   cuisine           price
0   Kilian Stuba    2019    47.348580   10.17114   Kleinwalsertal Austria    87568  Creative              $
1   Pfefferschiff   2019    47.837870   13.07917    Hallwang      Austria    5300   Classic cuisine       $
2   Esszimmer       2019    47.806850   13.03409    Salzburg      Austria    5020   Creative              $
3   Carpe Diem      2019    47.800010   13.04006    Salzburg      Austria    5020   Market cuisine        $
4   Edvard          2019    48.216503   16.36852    Wien          Austria    1010   Modern cuisine  

Then I check how many NaN values are in each column. In the case of the price column there are 151 values:

In [ ]: restaurants_df.isnull().sum()
Out[ ]: name           0
        year           0
        latitude       0
        longitude      0
        city           2
        region         0
        zipCode      149
        cuisine        0
        price        151
        dtype: int64

After, I replace those values with the string "No Price", and confirm that all values have been replaced.

In [ ]: restaurants_df["price"].fillna("No Price", inplace = True)
        restaurants_df.isnull().sum()
Out[ ]: name           0
        year           0
        latitude       0
        longitude      0
        city           0
        region         0
        zipCode        0
        cuisine        0
        price          0
        dtype: int64

However, when I show the first 5 rows, the problem persists.

In [ ]: restaurants_df.head()
Out[ ]:
    name            year    latitude    longitude     city        region    zipCode   cuisine           price
0   Kilian Stuba    2019    47.348580   10.17114   Kleinwalsertal Austria    87568  Creative              $
1   Pfefferschiff   2019    47.837870   13.07917    Hallwang      Austria    5300   Classic cuisine       $
2   Esszimmer       2019    47.806850   13.03409    Salzburg      Austria    5020   Creative              $
3   Carpe Diem      2019    47.800010   13.04006    Salzburg      Austria    5020   Market cuisine        $
4   Edvard          2019    48.216503   16.36852    Wien          Austria    1010   Modern cuisine  

Any idea why this is happening and how I can solve it? Thanks in advance!

  • Please provide the code in your question, and not the screenshot. This would ease the answerer to reproduce your problem, instead of typing it manually. – dboy May 01 '20 at 11:47
  • Sorry for the inconvenience, I'm new here and I'm trying to figure out the best way to ask the questions. Thanks for the tip, I'll edit my post then! – ricardoper9 May 01 '20 at 12:42
  • Please provide a [mcve]. Please do not share information as images unless absolutely necessary. See: https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors, https://idownvotedbecau.se/imageofcode, https://idownvotedbecau.se/imageofanexception/. – AMC May 01 '20 at 12:52

3 Answers3

1

What I understand is that you are dealing with both blank values and null values. These are handled differently. Check out this question to understand how to handle them.

Moosa Saadat
  • 1,159
  • 1
  • 8
  • 20
1

Viewing the dataset over at kaggle shows that the first four restaurants are 5 '$' while the fifth is 4 '$'. Thus, I'm guessing that jupyter notebook is just not displaying all the '$' visually, however the data internally is correct.

To double check if I'm correct try running

df.price

and see what you get. I think this might have something to do with jupyter's HTML handler when it tries to display four dollar signs. You can look at this issue that is similar to yours

If you're bothered by this, simplay replace the '$' symbols with a number using something like

df.replace({'price': {'$': 1, '$$': 2, '$$$': 3, '$$$$': 4, '$$$$$': 5}})
A Kareem
  • 586
  • 3
  • 10
  • You're right! I just double checked and, in fact, Jupyther isn't showing the full content of the `price` column. I already had in mind to convert each set of symbols to a numerical scale, so I'll proceed with that analysis then. Thanks for your help! – ricardoper9 May 01 '20 at 15:34
  • Hi @ricardoper9 if this or any answer has solved your question please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – A Kareem May 01 '20 at 15:46
  • Hey, thank you once again for the tip. Sure, my problem is solved so I'll accept it! – ricardoper9 May 01 '20 at 16:02
0

I don't think pandas will recognize areas with '' as null. for instance:

df2 = pd.DataFrame(np.array([[1, 2, ''], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

then:

df2.isnull()

    a   b   c
0   False   False   False
1   False   False   False
2   False   False   False

see here, and try:

pandas.options.mode.use_inf_as_na = True

EDIT:

you could also try replaying with:

df2.replace({'': 'No Price'}, inplace=True)

EDIT2: I believe @AKareem has the solution, but to expand you can use this to escape the latex

restaurants_df.replace({'price': {
                       '$': '\$', 
                       '$$': '\$$', 
                       '$$$': '\$$$', 
                       '$$$$': '\$$$$', 
                       '$$$$$': '\$$$$$'}}
           , inplace=True)
jayveesea
  • 2,886
  • 13
  • 25