So i have a dataframe, df:
Rank Name Platform ... JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii ... 3.77 8.46 82.74
1 2 Super Mario Bros. NES ... 6.81 0.77 40.24
2 3 Mario Kart Wii Wii ... 3.79 3.31 35.82
3 4 Wii Sports Resort Wii ... 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB ... 10.22 1.00 31.37
... ... ... ... ... ... ... ...
16593 16596 Woody Woodpecker in Crazy Castle 5 GBA ... 0.00 0.00 0.01
16594 16597 Men in Black II: Alien Escape GC ... 0.00 0.00 0.01
16595 16598 SCORE International Baja 1000: The Official Game PS2 ... 0.00 0.00 0.01
16596 16599 Know How 2 DS ... 0.00 0.00 0.01
16597 16600 Spirits & Spells GBA ... 0.00 0.00 0.01
I used df.describe
and it shows that the year count is less than the others:
So i thought that some values in Year are empty.
tried doing df.dropna()
but that didnt work.
I then tried printing the values of the column Year which were not numbers with this code (Probably not the best code but it works) along with the type()
:
with open("vgsales.csv", "r") as csv_file:
rows = csv_file.read().split("\n")
row_components = [row.split(",") for row in rows if len(row) > 0]
data_dict = {header:[] for header in row_components[0]}
for header_index, header in enumerate(row_components[0]):
print("header_index: ", header_index)
for row_index, row in enumerate(row_components[1:]):
data_dict[header].append(row[header_index])
for i in data_dict["Year"]:
if not i.isdigit():
print(i, type(i))
The output (same output repeated a lot):
N/A <class 'str'>
So then i tried the answers i found in this stackoverflow question:
df = df[df.Year != "N/A"]
and it didnt work either
Also tried df = df.drop(df[(df.Year == "N/A")].index)
and it didnt work
So then i thought Why dont i open it in excel and see what values are there when it is not a year. Indeed it was N/A
Any ideas what i can do? I want to clean the data so that all the columns have the same count for a machine learning project