0

In my ML projects I've started encountering 10 Gb+ sized csv files, so I am trying to implement an efficient way to grab specific lines from my csv files.

This led me to discover itertools (which can supposedly skip a csv.reader's lines efficiently, whereas looping over it instead would load every row it went over into memory), and following this answer I tried the following:

import collections
import itertools

with open(csv_name, newline='') as f:

    ## Efficiently find total number of lines in csv
    lines = sum(1 for line in f)

    ## Proceed only if my csv has more than just its header
    if lines < 2:
        return None   
    else:

        ## Read csv file
        reader = csv.reader(f, delimiter=',')

        ## Skip to last line
        consume(reader, lines)

        ## Output last row
        last_row = list(itertools.islice(reader, None, None))

with consume() being

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(itertools.islice(iterator, n, n), None)

However, I only get an empty lists from last_row, meaning something went wrong.

The short csv which I am testing this code out on:

Author,Date,Text,Length,Favorites,Retweets
Random_account,2019-03-02 19:14:51,twenty-two,10,0,0

Where am I going wrong?

halfer
  • 19,824
  • 17
  • 99
  • 186
Coolio2654
  • 1,589
  • 3
  • 21
  • 46
  • Would pandas not satisfy your need? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html – 20-roso Mar 02 '19 at 19:34
  • No, because it takes pandas around a few minutes to load the entire csv in for me, when I usually just need a specific line from them (in my example case above, the last line). – Coolio2654 Mar 02 '19 at 19:44
  • 1
    You need a newline offset index. So you can seek to a given line without reading the file. – Dan D. Mar 02 '19 at 20:55
  • 1
    Looping over it **does not load every row into memory**. It loads a single row at a time, so it only requires at most the max sized row's memory overhead. – juanpa.arrivillaga Mar 02 '19 at 21:01

1 Answers1

1

What's going wrong is you are iterating over the file to get its length exhausting the file iterator,

lines = sum(1 for line in f)

You need to either re-open the file, or use f.seek(0).

So either:

def get_last_line(csv_name):

    with open(csv_name, newline='') as f:
        ## Efficiently find total number of lines in csv
        lines = sum(1 for line in f) # the iterator is now exhausted

    if len(lines) < 2:
        return

    with open(csv_name, newline='') as f: # open file again
        # Keep going with your function
        ...

Alternatively,

def get_last_line(csv_name):

    with open(csv_name, newline='') as f:
        ## Efficiently find total number of lines in csv
        lines = sum(1 for line in f) # the iterator is now exhausted

        if len(lines) < 2:
            return

        # we can "cheat" the iterator protocol and
        # and move the iterator back to the beginning
        f.seek(0) 
        ... # continue with the function

However, if you want the last line, you can simply do:

for line in f:
   pass
print(line)

Perhaps, using a collections.deque would be faster (they use it in the recipe):

collections.deque(f, maxlen=1)

Here are two different ways to approach the problem, let me just create a file real quick:

Juans-MacBook-Pro:tempdata juan$ history > history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ cat history.txt | wc -l
    2000

OK, in an IPython repl:

In [1]: def get_last_line_fl(filename):
   ...:     with open(filename) as f:
   ...:         prev = None
   ...:         for line in f:
   ...:             prev = line
   ...:         if prev is None:
   ...:             return None
   ...:         else:
   ...:             return line
   ...:

In [2]: import collections
   ...: def get_last_line_dq(filename):
   ...:     with open(filename) as f:
   ...:         last_two = collections.deque(f, maxlen=2)
   ...:         if len(last_two) < 2:
   ...:             return
   ...:         else:
   ...:             return last_two[-1]
   ...:

In [3]: %timeit get_last_line_fl('history.txt')
1000 loops, best of 3: 337 µs per loop

In [4]: %timeit get_last_line_dq('history.txt')
1000 loops, best of 3: 339 µs per loop

In [5]: get_last_line_fl('history.txt')
Out[5]: '  588  history >> history.txt\n'

In [6]: get_last_line_dq('history.txt')
Out[6]: '  588  history >> history.txt\n'
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • So, how would I re-open the file, or use `f.seek(0)` somehow (not sure how it is used)? I find being able to find the length of my csv important, and would rather not stop doing that. Also, where would I use the `collections.deque(f, maxlen=1)`? – Coolio2654 Mar 02 '19 at 23:45
  • While I appreciate your work in the answer, it seems to me to give only the last line of the opened csv, instead of an arbitrary one. How could I use your two functions to return (ideally parsed by `csv.reader`) the 22nd row, for ex.? – Coolio2654 Mar 03 '19 at 00:51
  • 1
    @Coolio2654 use something like `line = next(itertools.islice(f, n, n+1), None)` for an arbitrary line, `n`. Take care of handling the header. I suggest you don't parse it using `csv` because parsing will require a lot of work that you are just going to throw away, if performance is your issue here. Just parse the last line, you can do `parsed = next(csv.reader(io.StringIO(line)))` to take advantage of the module – juanpa.arrivillaga Mar 03 '19 at 01:30