1

I have a simple pandas DataFrame, let's call it ratings:

ratings = pd.read_csv("ratings.csv", header=0, delimiter=",")
print(ratings)
userId  movieId  rating
    1        1     4.0
    1        3     4.5
    2        6     4.0
    3        2     5.5
    3       11     3.5
    3       32     3.0
    4        4     4.0
    5       26     4.5

I'm trying to get number of distinct values of a column, and I found this solution:

Count distinct values, use nunique:

df['hID'].nunique()

Count only non-null values, use count:

df['hID'].count()

Count total values including null values, use the size attribute:

df['hID'].size

So I followed:

print("%s unique users" % ratings["userId"].nunique())

And get an output like this:

(5,) unique users

After reading pandas.DataFrame.nunique() doc, I checked its datatype:

print(type(ratings["userId"].nunique()))
<class 'tuple'>

Now I don't know how to use this value in another variable as a numeric value. If I wrap it inside int():

print(type(int(ratings["userId"].nunique())))

the output will still be <class 'tuple'>, and calling that variable from another code will raise an error.

I'm quite new to Python, so I might give silly questions. Thanks for reading and helping me solving this!

Edit: here is my real code (since it doesn't support proper code formatting for commenting):

ratings = pd.read_csv(
    "../ml-latest-small/ratings.csv",
    header=0,
    delimiter=",",
    usecols=["userId", "movieId", "rating"]
)

numof_movies = ratings["movieId"].nunique()[0],
numof_users = ratings["userId"].nunique(),
numof_ratings = len(ratings)

print("\n%s movies, %s users and %s ratings are given\n" % (
    numof_movies,
    numof_users,
    type(numof_ratings)
))

And how the ratings.csv file looks like:

userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
...

And how the DataFrame looks when I print it to the terminal:

        userId  movieId  rating
0            1        1     4.0
1            1        3     4.0
2            1        6     4.0
3            1       47     5.0
4            1       50     5.0
...        ...      ...     ...
100831     610   166534     4.0
100832     610   168248     5.0
100833     610   168250     5.0
100834     610   168252     5.0
100835     610   170875     3.0
Tran Tam
  • 23
  • 1
  • 5
  • Something seems really wrong. `Series.nunique` will return an `int` and `DataFrame.nunique` returns a Series, neither of which are tuples. Likely you accidentally overwrote some variable or method you didn't mean to so you're calling a method that doesn't do what you think or a method on a variable that isn't what it should be. – ALollz Dec 18 '20 at 20:09
  • @ALollz I wasn't sure what I could have done wrong. I edited my question with my real code at the bottom, I hope you can give it a look (because long codes are not formatted properly in the comment). – Tran Tam Dec 18 '20 at 20:30
  • the fact that nunique returns a tuple is extremely odd, I believe your csv might have issues and pandas is casting something in order to construct the dataframe. Do you have cells within the csv with double values, ill format or something like that? – Yuca Dec 18 '20 at 20:43
  • @TranTam show us a screen of your dataframe, its first time i witness nunique returning a tuple – Yefet Dec 18 '20 at 20:49
  • Hi @Yefet and Yuca, I added the csv file structure and the print result of my DataFrame at the end of my post, thank you. – Tran Tam Dec 18 '20 at 20:56

2 Answers2

0
unique_users = ratings["userId"].nunique()
print(f"{unique_users} unique users" )
Sura-da
  • 301
  • 3
  • 12
0

IIUC:

import pandas as pd
from io import StringIO

rating_txt = StringIO("""userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931""")

ratings_df = pd.read_csv(rating_txt)
ratings_df

print(f"{ratings_df['movieId'].nunique()} movies, {ratings_df['userId'].nunique()} user(s), and {ratings_df['rating'].count()} ratings are given.")

Output:

5 movies, 1 user(s), and 5 ratings are given.
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • After trying out your code and comparing to mine, I found out that not passing those into variables, and printing them out directly worked, the output of their types are also `` as expected. I don't understand why, but thank you. – Tran Tam Dec 18 '20 at 21:22