0

I have a dataframe with over 200k records.

I wish to slim my dataframe down by half, by dropping one record for each that I keep (as is displayed below).

Keep Row 1,
Drop Row 2,
Keep Row 3,
Drop Row 4,
Keep Row 5,
and so on and so forth...

If this is not possible then I am more than willing to use pandas sample functionality in conjunction with a mask.

I'mahdi
  • 23,382
  • 5
  • 22
  • 30

2 Answers2

1

You can use slicing with this formula like that you want [1::2] ([start:end:step]), here start from 1 and step is 2, So : 1,3,5,..... Because you have dataframe, you can use df.index[1::2]. So you can keep 1, drop 2, keep 3, drop 4, ....

(index start from zero and if you want to start from zero you can try with [::2])

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A' : np.random.randint(0,10,1000),
    'B' : np.random.randint(0,10,1000)
})
print(df)

df = df.iloc[df.index[1::2]]
print(df)

#input random df
     A  B
0    1  8
1    5  4
2    8  4
3    9  0
4    9  5
..  .. ..
995  8  9
996  4  9
997  8  4
998  2  8
999  9  0

[1000 rows x 2 columns]

# result random df
     A  B
1    5  4
3    9  0
5    6  8
7    4  1
9    6  6
..  .. ..
991  1  5
993  6  8
995  8  9
997  8  4
999  9  0

[500 rows x 2 columns]
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
0

You can index with a mask like mask = (1 - np.arange(len(df)) % 2).astype(bool).

  • You can remove the 1 - if you're ok to start dropping with the first record instead of the second.
  • If you have a numerical index, you can replace np.arange(len(df)) with df.index.
  • You can replace ...astype(bool) with ... == 0, ... == 1, or even np.logical_not(...).
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264