26

I know it might be old debate, but out of pandas.drop and python del function which is better in terms of performance over large dataset?

I am learning machine learning using python 3 and not sure which one to use. My data is in pandas data frame format. But python del function is in built-in function for python.

Ralf
  • 16,086
  • 4
  • 44
  • 68
sagar jain
  • 361
  • 1
  • 3
  • 6
  • 1
    I will suggest to use drop, since it easily can achieve drop multiple column in one time. df.drop(['A','B']) – BENY Nov 22 '17 at 03:15
  • Check this out: https://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe-using-python-del – Greg Nov 22 '17 at 03:22
  • @Wen achieving multiple column drop wasn't my concern but for larger dataset, if only one column I need to delete,will drop performs better than del or vice versa? – sagar jain Nov 22 '17 at 03:35
  • @Greg this is what I was searching.Thanks a lot..I guess deleting will free some memory from data frame while dropping will just return dataframe while hiding the dropped column, Is it right or am I missing something? – sagar jain Nov 22 '17 at 03:39
  • @sagarjain you can make the `.drop` method work in-place by passing `df.drop(, inplace=True)`. I don't think there would be a performance difference. Can't you run a test if you are curios? – juanpa.arrivillaga Nov 22 '17 at 08:39
  • @juanpa.arrivillaga I tried over some datasets in kaggle but found not much difference.So I was asking.Thanks by the way. – sagar jain Nov 22 '17 at 10:51

4 Answers4

20

Summarizing a few points about functionality:

  • drop operates on both columns and rows; del operates on column only.
  • drop can operate on multiple items at a time; del operates only on one at a time.
  • drop can operate in-place or return a copy; del is an in-place operation only.

The documentation at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html has more details on drop's features.

flow2k
  • 3,999
  • 40
  • 55
10

Using randomly generated data of about 1.6 GB, it appears that df.drop is faster than del, especially over multiple columns:

df = pd.DataFrame(np.random.rand(20000,10000))
t_1 = time.time()
df.drop(labels=[2,4,1000], inplace=True)
t_2 = time.time()
print(t_2 - t_1)

0.9118959903717041

Compared to:

df = pd.DataFrame(np.random.rand(20000,10000))
t_3 = time.time()
del df[2]
del df[4]
del df[1000]
t_4 = time.time()
print(t_4 - t_3)

4.052732944488525

@Inder's comparison is not quite the same since it doesn't use inplace=True.

KT12
  • 549
  • 11
  • 24
7

tested it on a 10Mb data of stocks, got the following results:

for drop with the following code

t=time.time()
d.drop(labels="2")
print(time.time()-t)

0.003617525100708008

for del with the following code on the same column:

t=time.time()
del d[2]
print(time.time()-t)

time i got was:

0.0045168399810791016

reruns on different datasets and columns didn't make any significant difference

Inder
  • 3,711
  • 9
  • 27
  • 42
1

In drop method using "inplace=False" you have option to create Subset DF and keep un-touch the original DF, But in del I believe this option is not available.

Jagdish
  • 67
  • 6