I have a very large dataframe (about 1.1M rows) and I am trying to sample it.
I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe.
This is what Ive tried so far but all these methods are taking way too much time:
Method 1 - Using pandas :
sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]
Method 2 :
I tried to write all the sampled lines to another csv.
f = open("data.csv",'r')
out = open("sampled_date.csv", 'w')
out.write(f.readline())
while 1:
total += 1
line = f.readline().strip()
if line =='':
break
arr = line.split(",")
if (int(arr[0]) in sample_index_array):
out.write(",".join(e for e in (line)))
Can anyone please suggest a better method? Or how I can modify this to make it faster?
Thanks