8

I am reading a large file that contains ~9.5 million rows x 16 cols.

I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.

I am able to load the data, and then select every 500th row.

My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   RatecodeID  store_and_fwd_flag  PULocationID    DOLocationID    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   1   2017-01-09 11:13:28 2017-01-09 11:25:45 1   3.30    1   N   263 161 1   12.5    0.0 0.5 2.00    0.00    0.3 15.30
1   1   2017-01-09 11:32:27 2017-01-09 11:36:01 1   0.90    1   N   186 234 1   5.0 0.0 0.5 1.45    0.00    0.3 7.25
2   1   2017-01-09 11:38:20 2017-01-09 11:42:05 1   1.10    1   N   164 161 1   5.5 0.0 0.5 1.00    0.00    0.3 7.30
3   1   2017-01-09 11:52:13 2017-01-09 11:57:36 1   1.10    1   N   236 75  1   6.0 0.0 0.5 1.70    0.00    0.3 8.50
4   2   2017-01-01 00:00:00 2017-01-01 00:00:00 1   0.02    2   N   249 234 2   52.0    0.0 0.5 0.00    0.00    0.3 52.80
Omar Hijazi
  • 95
  • 1
  • 6

4 Answers4

10

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:

n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
yatu
  • 86,083
  • 12
  • 84
  • 139
  • quite useful. To reduce the amount of records by less than one half, the deletion step could occur more than once – mik Sep 03 '19 at 21:13
  • np.delete() generated an error when I tried to execute the code: IndexError: arrays used as indices must be of integer (or boolean) type – stevehs17 Nov 24 '21 at 00:49
3

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:

#https://stackoverflow.com/q/845058
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_of_file = file_len('test.csv')
print (len_of_file)

skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)

df = pd.read_csv('test.csv', skiprows=skipped)

How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_of_file = file_len('test.csv')

df1 = pd.read_csv('test.csv',
                  usecols=['tpep_pickup_datetime'],
                  parse_dates=['tpep_pickup_datetime'])

sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
                 .iloc[np.arange(0,len_of_file,500)].index)

skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)

df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

use a lambda with skiprows:

pd.read_csv(path, skiprows=lambda i: i % N)

to skip every N rows.

source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

william_grisaitis
  • 5,170
  • 3
  • 33
  • 40
0

You can use csv module return a iterator and use itertools.cycle to select every nth row.

import csv
from itertools import cycle

source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size))) 
with open(source_file) as f1:
    rdr = csv.reader(f1) 
    data = [row for pick, row in zip(chooser, rdr) if pick]
Joey Gao
  • 850
  • 2
  • 7
  • 14