0

I have a large data set and I need to randomly sample a smaller data set from it. the first column consists of different vehicle IDs and sampling should be done out of these vehicle IDs. each vehicle has more than one record, so I have multiple rows from one vehicle. The put the code that I am using down but this takes a lot of time to run. I was wondering if there is any faster way of doing so.

Example:

df:

vehicle_ID   SectionID      time
     1         200       00:00:03
    100        237       00:00:03
     1        1872       00:00:06

Code

veh = df['vehicle_ID'].unique()
sample = random.sample(list(veh), 12900)
ndf = pd.DataFrame()
for i in sample:
    new = df[df['vehicle_ID']==i]
    ndf=ndf.append(new , ignore_index =True)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
Ellie
  • 1

1 Answers1

0

Try this:

# Select a vehicle_id
data1 = df[df.vehicle_ID == np.random.choice(df['vehicle_ID'].unique())].reset_index(drop=True)

# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])

# Get a sequence starting at the random index from the previous step
print(df.loc[start_ix:start_ix+3])
Karthick Mohanraj
  • 1,565
  • 2
  • 13
  • 28
  • thank you for replying. I am not sure if that works for the purpose I am looking for. my final data frame must consist of all the records of all the randomly selected vehicles. – Ellie Oct 27 '19 at 22:07