In my ML projects I've started encountering 10 Gb+ sized csv files, so I am trying to implement an efficient way to grab specific lines from my csv files.
This led me to discover itertools
(which can supposedly skip a csv.reader
's lines efficiently, whereas looping over it instead would load every row it went over into memory), and following this answer I tried the following:
import collections
import itertools
with open(csv_name, newline='') as f:
## Efficiently find total number of lines in csv
lines = sum(1 for line in f)
## Proceed only if my csv has more than just its header
if lines < 2:
return None
else:
## Read csv file
reader = csv.reader(f, delimiter=',')
## Skip to last line
consume(reader, lines)
## Output last row
last_row = list(itertools.islice(reader, None, None))
with consume()
being
def consume(iterator, n):
"Advance the iterator n-steps ahead. If n is none, consume entirely."
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(itertools.islice(iterator, n, n), None)
However, I only get an empty lists from last_row
, meaning something went wrong.
The short csv which I am testing this code out on:
Author,Date,Text,Length,Favorites,Retweets
Random_account,2019-03-02 19:14:51,twenty-two,10,0,0
Where am I going wrong?