0

I am working on IPL dataset from Kaggle (https://www.kaggle.com/manasgarg/ipl). It has two .csv files with a primary key to connect the data. I want to drop rows where batting team has lost the match. df_deliv has batting team df_match has the winner of the match

I achieved it using the below code but its very slow due to the for loop.

import pandas as pd
import numpy as np

df_deliv = pd.read_csv("deliveries.csv")
df_match = pd.read_csv("matches.csv")
df_deliv = df_deliv[["match_id", "batting_team", "batsman", "batsman_runs"]]
df_deliv["winner"] = [df_match.loc[i-1]["winner"] for i in df_deliv["match_id"]] #makes it very slow
df_deliv.drop(df_deliv[df_deliv["batting_team"] != df_deliv["winner"]].index, inplace = True)
print(df_deliv)

is there a way to do in one df.drop statement rather than the for loop???

  • 3
    Please, post a reproducible example. Why don't you join them and then just filter by the conditions you want instead of using a drop ? – Manrique Nov 21 '18 at 17:44
  • You could probably join the two dataframes using `merge()`. Please post `df_deliv.head()` and `df_match.head()` so we can see structure of dataframes and offer a more complete solution. – Swagga Ting Nov 21 '18 at 17:45
  • @AntonioManrique sir, i am very new to asking questions and to data science... please let me know what is a reproducible example. – Yash Mishra Nov 21 '18 at 18:29
  • @YashMishra of course i can :) It's basically to post the code that allow's us to reproduce your dataset and your error. Here you have a better explanation: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Manrique Nov 21 '18 at 19:03

1 Answers1

0

Instead of droping, you can just filter the rows that you need. Something like this:

df_deliv = df_deliv[df_deliv['batting_team']==df_deliv['winner']]
Ronnie
  • 391
  • 2
  • 6
  • 19