0

What is correct way to filter out data from functions? Should I try to compress everything as much as possible (search_query) or should I filter through list everytime there is new argument that needs to be included (search_query2). More arguments I have, quicker I become more confused how to deal with this problem. Example:

import os

query = ""
my_path = os.getcwd()


def search_query(query, path, extensions_only=False, case_sensitive=False):
    results = []
    if extensions_only is True:
        for f in os.listdir(path):
            if case_sensitive:
                if f.endswith(query):
                    results.append(os.path.join(path, f))
            else:
                if f.endswith(query):
                    results.append(os.path.join(path, f).lower())

    elif case_sensitive is not True:
        for f in os.listdir(path):
            if query.lower() in f.lower():
                results.append(os.path.join(path, f))

    return results


results = search_query("_c", my_path)
print(results)


# Alternative way to deal with this
def search_query2(query, path, extensions_only=False, case_sensitive=False):
    results = []

    for f in os.listdir(path):
        results.append(os.path.join(path, f))

    if extensions_only:
        filtered_lst = []
        for part in results:
            if part.endswith(query):
                filtered_lst.append(part)
        results = filtered_lst

    if case_sensitive:
        filtered_lst = []
        for part in results:
            if query in part:
                filtered_lst.append(part)
        results = filtered_lst
    elif not case_sensitive:
        filtered_lst = []
        for part in results:
            if query.lower() in part.lower():
                filtered_lst.append(part)
        results = filtered_lst

    print(results)
    return results


search_query2("pyc", my_path, case_sensitive=True)
Hsin
  • 307
  • 5
  • 15
  • For example, `query` and `extensions_only` are redundant. You should only have one that can include wildcards (and maybe defaults to "\*"?) I guess that it could also invalidate `case_sensitive` if you would use *glob*. You could check https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory/48393588#48393588 for more (not directly related to this question) details. – CristiFati Apr 03 '18 at 10:20

4 Answers4

2

There isn't a fits-all "correct" way to do things like this. Another option is making separate functions, or private sub-functions called by this one as a wrapper.

In your specific case there are ways of optimising what you want to do in order to make it more clear.

You do a lot of

x = []
for i in y:
    if cond(i):
        x.append(i)
y = x

This is known as a filter and python has a couple of ways of doing this in one line

y = list(filter(cond, y))  # the 'functional' style

or

y = [i for i in y if cond(i)]  # comprehension

which make things a lot clearer. There are similar things for mappings where you write:

x = []
for i in y:
        x.append(func(i))
y = x

# instead do:
y = list(map(func, y))  # functional 
# or
y = [func(i) for i in y]  # comprehension

We can also combine maps and filters:

x = list(map(func, filter(cond, y)))
x = [func(i) for i in y if cond(i)]

using these we can build up many filters and maps in a row whilst remaining very clear about what we are doing. This is one of the advantages of functional programming.


I've modified your code to use generator expressions which will only evaluate right at the end when we call list(results) saving a lot of wasted time making new lists each time:

def search_query2(query, path, extensions_only=False, case_sensitive=False):

    results = (os.path.join(path, f) for f in os.listdir(path))

    if extensions_only:
        results = (part for part in results if part.endswith(query))
    elif case_sensitive:  # I'm pretty sure this is actually the logic you want
        results = (part for part in results if query in part)
    else:
        results = (part for part in results if query.lower() in part.lower())

    return list(results)
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
FHTMitchell
  • 11,793
  • 2
  • 35
  • 47
  • 2
    And this approach works whether or not the effects of each argument are mutually exclusive. `results` just becomes a more and more nested genexpr as you go, with each layer performing one additional filter or mapping operation. It's far simpler than the OP's approach of constructing several temporary `list`s with arbitrarily different names, even though each one may as well replace the original for return purposes. – ShadowRanger Apr 03 '18 at 10:34
  • Thanks for answer. This looks great. I like how much cleaner code looks like. In my case though each option should be if without aditional elifs because I might want to use extensions trigger and case sensitive trigger (and few others) in one query. So my question is. Is generator better way to do this than making new list every time? From what I read (correct me if I'm wrong) making new lists is good practice because the biggest list used stays in memory until I create new one (or something like that) so iterating through it when I narrow my search is not recommended. – Hsin Apr 04 '18 at 11:47
  • 1
    @Hsin Generators are the best way to do it. The question isn't one of memory (although generators are more memory efficient anyway) but is actually one of speed. It takes time to create a list each time you run a map or filter operation because the interpreter needs to dynamically allocate memory. Hence using `list.append` or equivalent can cause significant slow downs when creating lists. This happens whenever the list needs becomes bigger than it's currently allocated memory. Obviously this all happens in the background, but my advice is use generators instead of lists wherever possible. – FHTMitchell Apr 04 '18 at 12:16
  • Thanks for explanation. I'll be grateful for links to more information. I really struggle with list comprehension and generators. It's hard for me to read them. – Hsin Apr 04 '18 at 16:30
  • 1
    @Hsin Basically like I put above: `x = [func(i) for i in y if cond(i)]` is the same as `x = []; for i in y: if cond(i): x.append(func(i))`. The hard bit is that the order is different, but you get used to it. **More links:** [List comprehensions](https://www.datacamp.com/community/tutorials/python-list-comprehension) & [Generators and comprehensions](https://www.programiz.com/python-programming/generator) – FHTMitchell Apr 04 '18 at 16:46
0

Do you want to filter all the same types of files? You can do this by using the glob module. for example

import glob

# Gets all the images in the specified directory.
print(glob.glob(r"E:/Picture/*/*.jpg"))

# Gets all the .py files from the parent directory.
print glob.glob(r'../*.py') 
  • I want to create quick search function in pyw tkinter window. I also want to add arguments to be more precise with my search query (that's why I have extension searcher and posiblity to turn off case sensitivity) – Hsin Apr 03 '18 at 10:31
0

I like to "prepare" my conditions at the start to get things nice and tidy and make it slighlty easier later on. Identify what effect the different arguments have on the code. In this instance, "case_sensitive" defines whethere you are using f.lower() or not, and "extensions" is defining your comparison method. In this instance I would write something like this.

def search_query(query, path, extensions_only=False, case_sensitive=False):
results = []

for f in os.listdir(path):
    if case_sensitive is True:
        fCase=f.lower()
        queryCase = query.lower()
    elif case sensitive is False:
        fCase = f
        queryCase = query

if extensions_only is True:
    if f.endswith(query):
                results.append(os.path.join(path, f))
elif extensions_only is False:
    if query in f:
            results.append(os.path.join(path, f))
return results

results = search_query("_c", my_path)
print(results)

This allows me to define the impact which each result has on the function at a different level, without having them nested and a bit of a headache to keep track of!

James Lingham
  • 419
  • 1
  • 3
  • 17
0

Another possibility: You could just use a single conditional list comprehension:

def search_query2(query, path, extensions_only=False, case_sensitive=False):
    files = [os.path.join(path, f) for f in os.listdir(path)]

    result = [part for part in files
              if (not extensions_only or part.endswith(query)) and
                 (query in part if case_sensitive
                  else query.lower() in part.lower())]

    print(results)
    return results

This may seem very "dense" and incomprehensible at first, but IMHO it makes it very clear (even clearer than your variable names) that all those conditions are merely filtering and never e.g. changing the actual elements of the result list.

Also, as noted in comments, you could just use a default-parameter extension="" instead of extensions_only (everything ends with "", and you could even pass a tuple of valid extensions). Either way, it is not entirely clear how the endswith and in constraints should play together, or whether the extension should also match case_sensitive or not. Further, files could probably be simplified as glob.glob(path + "/*").

But those points do not change the argument for using a single list comprehension for filtering the list of results.

tobias_k
  • 81,265
  • 12
  • 120
  • 179