0

Sorry for that train wreck of a title...not sure how else to word it.

I'm ingesting files from a certain directory one category at a time. The category is part of the filename following a very specific format, but there are a few issues throwing my process off.

Example filename:

.../Bike.txt

If there's an overabundance of source data for a particular category, the system will create numbered files to handle the overflow. In that case, the files may look like this:

.../Bike_1.txt

.../Bike_2.txt

I need to grab the files for a particular category regardless of whether it's "Bike.txt" or "Bike_1.txt". I figured I could use a wildcard to find files matching "Bike*.txt". The problem with this is that I may also have a file called something like "Bike_Helmet.txt", and I do not want to ingest that file if I'm currently looking at the bike category.

This is being done usying PySpark in Databricks. I've used the glob library up until now to handle this, but I'm not sure it can do what I need here.

To summarize, after specifying a category, I want to find files that match the following formats:

.../[category].txt

.../[category]_[a number].txt

But I do not want to retrieve files that are of the format .../[category]_[non-numeric string].txt.

Is there a way to do this in a single pass, or will I have to ingest based on .../[category].txt first and then .../[category]_[0-9]*.txt a second time?

Eric J
  • 192
  • 1
  • 2
  • 10

2 Answers2

1

You could use pathlib (or the older glob, or simply os.listdir()) to search all files starting with "Bike" and then use a regular expression to ignore the invalid results.

import pathlib
import re

def get_files(category):
    prog = re.compile(category + '(_\d+)?\.txt')
    return [file for file in pathlib.Path('..').glob(category + '*.txt') if prog.match(file.name)]


bike_files = get_files('Bike')
wovano
  • 4,543
  • 5
  • 22
  • 49
-1

I think you can use Python within Pyspark to deal with this.

Let's assume you can get a list of all the files in the target directory via glob. (I'm uncertain if this is the case, or if you need to scan the files and conditionally ingest at the same time. But for the sake of this first answer, I'm making the above assumption).

Let's say this yields the following list:

file_list = [
    'Bike.txt',
    'Bike_1.txt',
    'Bike_2.txt',
    'Bike_49341.txt',
    'Bike_helmet.txt',
    'Bike_wheelie.txt',
    'Helmet.txt',
    'Helmet_1.txt',
]

This SO answer offers a good solution on how to determine whether a string is a number using:

def is_number(n):
    try:
        float(n)   # Type-casting the string to `float`.
                   # If string is not a valid `float`, 
                   # it'll raise `ValueError` exception
    except ValueError:
        return False
    return True

Now you have a list of filenames and a function to determine if a string is a number. Using this, we can get a list of valid file names.

from pathlib import PurePath

target_category = "bike"
valid_files = []
for file_name in file_list:
    file_stem = PurePath(file_name).stem
    file_split = file_stem.split("_")
    if file_split[0].lower() == target_category:
        if len(file_split) == 1:
            valid_files.append(file_name)
        else:
            if is_number(file_split[1]):
                valid_files.append(file_name)

which yields:

>>> valid_files
['Bike.txt', 'Bike_1.txt', 'Bike_2.txt', 'Bike_49341.txt']

You can now go back and import only those files in valid_files

EDIT: changed the answer so it checks to make sure the category is correct, first.

NOTE: PurePath(filename).stem will only work if the files have a single (i.e. .txt) and not multiple (i.e. .tar.gz) suffixes.

the-ucalegon
  • 117
  • 5