1

I have the two following functions to extract data from a csv file, one returns a list and the other a generator:

List:

def data_extraction(filename,start_line,node_num,span_start,span_end):
    with open(filename, "r") as myfile:
        file_= csv.reader(myfile, delimiter=',')  #extracts data from .txt as lines
        return [filter(lambda a: a != '', row[span_start:span_end]) \
        for row in itertools.islice(file_, start_line, node_num+1)]

Generator:

def data_extraction(filename,start_line,node_num,span_start,span_end):
    with open(filename, "r") as myfile:
        file_= csv.reader(myfile, delimiter=',')  #extracts data from .txt as lines            
        return (itertools.ifilter(lambda a: a != '', row[span_start:span_end]) \
                for row in itertools.islice(file_, start_line, node_num+1))

I start my program by a call to one of the following functions to extract the data. The next line is: print [x in data]

When I use the function which returns a list it all works fine, when I use the generator I get : ValueError: I/O operation on closed file

I gathered from other questions that it was due to the fact that the with open statement was probably lost once my data_extraction function returns.

The question is: Is there a workaround to be able to keep an independent function to extract the data so that I don't have to put all my code inside one function ? And secondly will I be able to reset the generator to use it multiple times ?

the reason for wanting to keep the generator over the list is that I am dealing with large datasets.

Sorade
  • 915
  • 1
  • 14
  • 29
  • http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python – Rolf of Saxony Sep 23 '16 at 08:50
  • The 2nd function is not a generator - a ganarator should `yield` rather than `return` something. – ivan_pozdeev Sep 23 '16 at 08:50
  • @ivan_pozdeev The second function *returns* a generator. – MisterMiyagi Sep 23 '16 at 08:57
  • ...and exits the `with` block upon doing so! – ivan_pozdeev Sep 23 '16 at 08:59
  • @ivan_pozdeev That doesn't stop it from returning a generator. Also, yielding from inside the with statement slightly defeats the purpose of the with statement, because then you have to worry about closing the generator as well! – MisterMiyagi Sep 23 '16 at 09:03
  • BTW: your backslash for "line continuation" is **useless**. Parenthesis (or brackets of any kind) already imply line continuation and you should **never** use backslashes to do line continuation in python. So just do `[ blah blah blah blah]` or `(blah blah blah blah)`. – Bakuriu Sep 23 '16 at 10:05

1 Answers1

4

Note that the with statement closes the file at its end. That means no more data can be read from it.

The list version actually reads in all data, since the list elements must be created.

The generator version instead never reads any data until you actual fetch data from the generator. Since you do that after closing the file, the generator will then fail because it then tries to fetch it.

You can only avoid this by actually reading in the data, e.g. as you did via creating the list. Trying not to hold all data (generator) but still wanting to have all data (closing the file) doesn't make sense.

The alternative is to open the file each time for reading - the file object acts like a generator for its values. If you want to avoid duplicating the filtering code, you can create a wrapper for this:

The straightforward way is to turn your generator-returning function into a generator function itself:

def data_extraction(filename,start_line,node_num,span_start,span_end):
    with open(filename, "r") as myfile:
        file_= csv.reader(myfile, delimiter=',')  #extracts data from .txt as lines            
        for item in (itertools.ifilter(lambda a: a != '', row[span_start:span_end]) \
            for row in itertools.islice(file_, start_line, node_num+1)):
            yield item

This has a bit of a problem: the with statement will only close once the generator is exhausted or collected. This brings you into the same situation as having an open file, which you must finish as well.

A safer alternative is to have a filter generator and feed it the file content:

def data_extraction(file_iter, start_line, node_num, span_start, span_end):
    file_= csv.reader(file_iter, delimiter=',')  #extracts data from .txt as lines            
    for item in (itertools.ifilter(lambda a: a != '', row[span_start:span_end]) \
        for row in itertools.islice(file_, start_line, node_num+1)):
        yield item

# use it as such:
with with open(filename, "r") as myfile:
    for line in data_extraction(mayflies):
        # do stuff

If you need this often, you can also create your own class by implementing the context manager protocol. This can then be used in a with statement instead of open.

class FileTrimmer(object):
    def __init__(self, filename, start_line, node_num, span_start, span_end):
        # store all attributes on self

    def __enter__(self):
        self._file = open(self.filename, "r")
        csv_reader = csv.reader(self._file, delimiter=',')  #extracts data from .txt as lines         
        return (
            itertools.ifilter(
                lambda a: a != '',
                row[self.span_start:self.span_end])
                for row in itertools.islice(
                    csv_reader,
                    self.start_line,
                    self.node_num+1
        ))

    def __exit__(self, *args, **kwargs):
         self._file.close()

You can now use it like this:

with FileTrimmer('/my/file/location.csv', 3, 200, 5, 10) as csv_rows:
    for row in csv_rows:  # row is an *iterator* over the row
        print('<', '>, <'.join(row), '>')
MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119
  • Thanks for the answer ! Could you add an example of how the `FileTrimmer` class would be used ? Say I want to print the file content row by row. – Sorade Sep 23 '16 at 10:28
  • 1
    @Sorade I've added a short example. Let me know if there are any typos, I don't have a CSV file around to test it on. – MisterMiyagi Sep 23 '16 at 10:31
  • I think that in the FileTrimmer method __enter__ self._file should be open(self.filename, "r") rather than open (filename, "r"). When I ran the example it printed out many lines of: – Sorade Sep 23 '16 at 10:40
  • 1
    @Sorade Sorry, I didn't actually inspect your iterator chaining. It's fixed now, the row iterator is resolved using `str.join`. I really suggest conforming to PEP8 when asking questions, the formatting put me off from validating whether that sub-problem was done properly. Have a look at the changed format, it's way more readable what actually happens. – MisterMiyagi Sep 23 '16 at 10:47