python generator parsing one file at a time

Question

I often have a folder with a bunch of csv files or excel or html etc. I get tired of always writing a loop iterating over the files in a folder and then opening them with the appropriate library, so I was hoping I could build a generator that would yield, one file at a time, the file already opened with the appropriate library. Here's what I had been hoping to do:

def __get_filename__(file):
    lst = str(file).split('\\')[-1].split('/')[-1].split('.')
    filename, filetype = lst[-2], lst[-1]
    return filename, filetype

def file_iterator(file_path, parser=None, sep=None, encoding='utf8'):
    import pathlib as pl
    if parser == 'BeautifulSoup':
        from bs4 import BeautifulSoup
    elif parser == 'pandas':
        import pandas as pd

    for file in pl.Path(file_path):
        if file.is_file():
            filename, filetype = __get_filename__(file)
            if filetype == 'csv' and parser == 'pandas':
                yield pd.read_csv(file, sep=sep)
            elif filetype == 'excel' and parser == 'pandas':
                yield pd.read_excel(file, engine='openpyxl')
            elif filetype == 'xml' and parser == 'BeautifulSoup':
                with open(file, encoding=encoding, errors='ignore') as xml:
                    yield BeautifulSoup(xml, 'lxml')
            elif parser == None:
                print(filename, filetype)
                yield file

but my hopes and dreams are crushed :P and if I do this:

for file in file_iterator(r'C:\Users\hwx756\Desktop\tmp/'):
    print(file)

this throws the error TypeError: 'WindowsPath' object is not iterable

I am sure there must be a way to do this somehow and I'm hoping that someone out there much smarter than me knows :) thanks!

Nathan Roberts · Answer 1 · 2021-12-15T14:39:53.573

3

As the error says 'WindowsPath' object is not iterable, your line for file in pl.Path('...'): is causing the error because you are trying to iterate the it. I haven't used the pathlib library before but from looking at the docs, if you do for file in pl.Path('...').iterdir(): that should allow you to iterate through your directory in the way you seem to be trying.

edited Dec 15 '21 at 14:39

answered Dec 15 '21 at 14:24

Nathan Roberts

828
2
10

thanks. but the question wasn't really about the error and how to solve it. i know that the pl.Path() object cannot be iterated over, but i want to have something like it that can be iterated over. i know that I can (and I have in the past) used pl.Path() in a straightforward loop, but the point was to get a generator that in a oneliner gives me a file, opened with whichever library I commonly use, without having to write more or less all the code that's in my function file_iterator() again and again every time. instead i call the generator function and it's done for me – jackewiebohne Dec 15 '21 at 14:56

score 1 · Accepted Answer · answered Dec 15 '21 at 14:30

so this is what i think you should do. get the names of all files in your folder by this

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(folder_path) if isfile(join(folder_path, f))]

make that path absolute and use that absolute path to read files in pandas

also that file has typo

        yield pd.read_excel(path, engine='openpyxl')

No such thing as path

python generator parsing one file at a time

2 Answers2