Unable to remove rows from dataframe based on condition

Question

So i have a dataframe, df:

        Rank                                              Name Platform  ...  JP_Sales Other_Sales Global_Sales     
0          1                                        Wii Sports      Wii  ...      3.77        8.46        82.74     
1          2                                 Super Mario Bros.      NES  ...      6.81        0.77        40.24     
2          3                                    Mario Kart Wii      Wii  ...      3.79        3.31        35.82     
3          4                                 Wii Sports Resort      Wii  ...      3.28        2.96        33.00     
4          5                          Pokemon Red/Pokemon Blue       GB  ...     10.22        1.00        31.37     
...      ...                                               ...      ...  ...       ...         ...          ...     
16593  16596                Woody Woodpecker in Crazy Castle 5      GBA  ...      0.00        0.00         0.01     
16594  16597                     Men in Black II: Alien Escape       GC  ...      0.00        0.00         0.01     
16595  16598  SCORE International Baja 1000: The Official Game      PS2  ...      0.00        0.00         0.01     
16596  16599                                        Know How 2       DS  ...      0.00        0.00         0.01     
16597  16600                                  Spirits & Spells      GBA  ...      0.00        0.00         0.01

I used df.describe and it shows that the year count is less than the others:

So i thought that some values in Year are empty. tried doing df.dropna() but that didnt work.

I then tried printing the values of the column Year which were not numbers with this code (Probably not the best code but it works) along with the type():

with open("vgsales.csv", "r") as csv_file:
    rows = csv_file.read().split("\n")
    row_components = [row.split(",") for row in rows if len(row) > 0]

    data_dict = {header:[] for header in row_components[0]}

    for header_index, header in enumerate(row_components[0]):
        print("header_index: ", header_index)
        for row_index, row in enumerate(row_components[1:]):
            data_dict[header].append(row[header_index])
    
    for i in data_dict["Year"]:
        if not i.isdigit():
            print(i, type(i))

The output (same output repeated a lot):

N/A <class 'str'>

So then i tried the answers i found in this stackoverflow question: df = df[df.Year != "N/A"] and it didnt work either

Also tried df = df.drop(df[(df.Year == "N/A")].index) and it didnt work

So then i thought Why dont i open it in excel and see what values are there when it is not a year. Indeed it was N/A

Any ideas what i can do? I want to clean the data so that all the columns have the same count for a machine learning project

score 2 · Accepted Answer · answered May 11 '21 at 22:42

2

First off, it's important to know why you're missing data, and to see if you can possibly impute rather than just drop.

If you still want to drop, you can use df = df.dropna(how='any').

The reason why Excel shows "N/A" as the value for missing data is because that's Excel's way of showing missing data. It doesn't mean that the value of the cell that is missing data is N/A--that would be a string containing an N, a slash, and an A. Instead, you can try df = df[~df['Year'].isnull()] as an alternative method for selecting non-null values.

answered May 11 '21 at 22:42

Yehuda

1,787
2
15
49

`df = df.dropna(how="any")` works Thank you. How is that different from just `df.dropna()` that makes this work but not that? – Muhd Mairaj May 11 '21 at 22:51
And you mention, i should see if i can impute, what would that be like? – Muhd Mairaj May 11 '21 at 22:53
1

@MuhdMairaj `df.dropna()` takes a position keyword argument called `how` this by default is set to `all` meaning that _all_ items in a row must be a `NaN` value for it to be dropped. You can also apply this to columns by changing the `axis` argument. using `any` means if any of the data is missing from a given row it will be dropped. – Umar.H May 11 '21 at 23:39
I see. Thank you once again. I found a source where it set `how="all"` explicitly, and it obviously didnt work from your comment – Muhd Mairaj May 12 '21 at 02:47
1

@MuhdMairaj Imputation involves intelligently filling in null values. Check out [this answer](https://stackoverflow.com/a/57443381/7287543) for a discussion about various techniques. – Yehuda May 12 '21 at 14:13

Unable to remove rows from dataframe based on condition

1 Answers1