6

I'm looking for alternatives to using comprehensions for nested data structures or ways to get comfortable with nested list comprehensions if possible.

Without comprehensions generating a list of items using a nested loop works like this:

combos = []
for a in iterable:
    for b in valid_posibilities(a):
        combos.append((a,b))

turning this into a comprehension retains the order of the loops which makes using multiple lines nice:

combos = [
    (a,b)
    for a in iterable
        for b in valid_posibilities(a)
    ]

However this creates a single list. If I want some code to produce a nested data structure then I would use something like this:

# same as above but instead of list of (a,b) tuples,
# I want a dictionary of {a:[b]} structure
combos_map = {}
for a in iterable:
    options = []
    for b in valid_posibilities(a):
        options.append(b)
    combos_map[a] = options

(the following snippet has the equivalent code using plain lists for those who haven't seen dictionary comprehension before and the first time seeing it being nested in a weird way is hard to follow)

# for people unfamilar with dictionary comprehension
# this is the equivelent nesting structure
combos = []
for a in iterable:
    options = []
    for b in valid_posibilities(a):
        options.append(b)
    combos.append(options)

######## or equivelently
combos = [
      [b
        for b in valid_posibilities(a)
      ]
    for a in iterable
    ]

Now converting it to a comprehension we get this:

combos_map = {
    a:[b
        for b in valid_posibilities(a)
      ]
    for a in iterable
    }

What the heck? The order of the loops switched! This is because the inner loop has to be put inside the inner list. If it was just always reversed when you want a nested data structure I'd be fine but conditions or non-nesting loops make it worse:

# for a list of files produce a mapping of {filename:(set of all words)}
# only in text files.
file_to_words_map = {}
for filename in list_of_files:
    if filename.endswith(".txt"):
        word_set = set()
        for line in open(filename):
            for word in line.split():
                word_set.add(word)
        file_to_words_map[filename] = word_set
        

### or using comprehension we get this lovely mess:

file_to_words_map = {
    filename: { word
            for line in open(filename)
               for word in line.split()
        }
    for filename in list_of_files
        if filename.endswith(".txt")
    }

I teach python to beginners and on the occasion that someone wants to generate a nested data structure with comprehensions and I tell them 'it isn't worth it' I'd like to be able to send them here as a nicer explanation for why.

So for the people I will send here I'm looking for is one of the following:

  1. Is there another way to refactor these kinds of loops that make the code easier to follow instead of just directly sticking them in comprehensions?

  2. is there a way to interpret and construct these nested loops in an intuitive way? At some point someone who is not familiar with python comprehensions will stumble across some code like the ones shown here and hopefully will end up here looking for some insight.

Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59
  • 1
    The thing that makes Python odd is that, when multiple `for` clauses are present at the same level in a comprehension, the outer loop appears first. This is contrary to how one would normally group them, with the first loop being part of an inner grouping. Fortran had implied DO loops, and they used the more logical ordering. Python reversed it, for convenience, but it makes it less consistent and more confusing. – Tom Karzes Nov 09 '20 at 00:08
  • Just think of it this way: When they occur at the same level, they're reversed in Python, with the outer loop appearing at the inner level, i.e. in the same order they would appear if they were broken out into separate loop statement. It's just one of the inconsistencies of the language that you have to live with. It doesn't reduce the expressive power. It just obfuscates it a bit. – Tom Karzes Nov 09 '20 at 00:08
  • I wonder why such incomprehensible comprehensions should be used in practice at all. You could show them to your students once so they know what is possible with nested comprehensions but not much more. – Michael Butscher Nov 09 '20 at 00:11
  • 1
    @MichaelButscher This usually comes up as someone is using a list comprehension but wants to make it produce a nested data structure and I inevitably tell them to just use the explicit loop and they ask why. It is always a struggle to balance hand waving or going into too much detail to give a satisfactory answer for why it's not worth it. My real goal is to have "look up this answer" as the explanation :) – Tadhg McDonald-Jensen Nov 09 '20 at 00:27
  • @TomKarzes sorry I wasn't able to follow, writing an answer with some code blocks might help ;) – Tadhg McDonald-Jensen Nov 09 '20 at 00:28
  • Your first nested loop can be written on one line as `for a in iterable: for b in valid_posibilities(a):...` which looks like the comprehension with the colons removed. More than one nesting and best to expand to a for loop. – wwii Nov 09 '20 at 00:37
  • @TadhgMcDonald-Jensen Logically, if you have `expr for i1 for i2`, it should group as `(expr for i1) for i2`, with the `i1` loop being the inner loop. That's how Fortran implied DO loops work. Python has instead reversed it. The `for` loops group together, with the leftmost `for` being the outer one. There is no logical grouping the way you'd like to think of it. Basically, the loop that is closest to the expression it applies to should be the inner loop. Python broke with that established convention for the dubious benefit of being able to scan them left-to-right. – Tom Karzes Nov 09 '20 at 00:38
  • @TomKarzes please post an answer, you have enough to say on this that it would help future readers to see what you have to say in an answer instead of buried in a long comment chain that gets hidden by default. Part of the power of StackOverflow is to let the voting system order answers based on how helpful they are but few people will find your help if it's in a comment. – Tadhg McDonald-Jensen Nov 09 '20 at 01:13
  • 1
    @TadhgMcDonald-Jensen - Question starts with "I am looking for an intuition that helps remember and easily parse out list comprehensions". Question ends by **dropping** that original ask, and saying instead "So my real question is: Is there another way to refactor these kinds of loops that make the code easier to follow instead of just directly sticking them in comprehensions?". Isn't that a big U-turn, and a big-contradiction ? Future readers will be confused about the real ask here. It also causes avoidable debates about relevance of posted answers. – fountainhead Nov 09 '20 at 02:09
  • 1
    @fountainhead thank you and fixed. I started by trying to mimic [this question in intent](https://stackoverflow.com/questions/37642573/how-can-i-make-sense-of-the-else-clause-of-python-loops/37643965#37643965) but realized after getting comments about "just don't do nested comprehensions because they are confusing" that that was the wrong direction so I rephrased it, missed the opening sentence. – Tadhg McDonald-Jensen Nov 09 '20 at 02:15
  • 1
    @TadhgMcDonald-Jensen - To me personally, the original ask which is now removed, was the more interesting one (actually started working on answering it !). I don't really believe that nested loops have some **inherent** un-readability that calls for some refactoring. AFAIK, the most popular motivation for people to use comprehensions is the potential brevity, and not the because the nested loops have some un-readability – fountainhead Nov 09 '20 at 02:24
  • 1
    @fountainhead does the current version seem reasonable? I don't post many questions and asking in a way that is open but still focused and useful is really hard. – Tadhg McDonald-Jensen Nov 09 '20 at 02:33
  • IMHO, your code is already easy enough to follow. Could you be more specific about how exactly you want it to be refactored? – Georgy Nov 16 '20 at 10:50
  • @Georgy I was trying to mimic [this question](https://stackoverflow.com/q/37642573/5827215) in basic intent, my goal was to have a resource that I can point people to in the same way. I'm not totally sure what I am really looking for in an answer so I'm not sure how to phrase it in a way that seems reasonable. Do you think this seems less suitable than the `for-else` question? – Tadhg McDonald-Jensen Nov 17 '20 at 00:28
  • @TadhgMcDonald-Jensen Actually, I think that the `for-else` question is off-topic and should've been closed. A question asking how to remember semantics of a language construct can't have a definite answer IMO. I can't find a relevant discussion on meta about it, though. If you are willing, you could post a question there to find out what the community thinks. Maybe, I'm in the wrong here. – Georgy Nov 17 '20 at 11:38
  • @TadhgMcDonald-Jensen " _However this creates a single list._" I have also had the same problem. In my case, I wanted to iterate through a list of lists containing tokenised tweets, to remove the stopwords. It looks like this: `[['predicting', 'stock', 'performance', 'with', 'natural', 'language', 'deep', 'learning'], ['facebook', 'stock', 'drops', 'more', 'than', '20', 'after', 'revenue', 'forecast', 'misses']]` – Ayşe Nur May 22 '21 at 15:35
  • So I just needed to add the square brackets at the right place to make it work: ```no_stopword_tweets = [[word for word in tweet if word not in stopwords.words("english")] for tweet in tokenised_tweets]``` – Ayşe Nur May 22 '21 at 15:36
  • It actually makes sense, since this is a nested loop, we need to tell it the boundaries of our lists; otherwise, it becomes only one list. We preserve the list structure through adding square brackets around the list comprehension part we use to process the inner list items. – Ayşe Nur May 22 '21 at 15:44

2 Answers2

5

Maybe the problem is that you are over-using list comprehension. I love it too, but, what purpose does it serve when the code becomes more convoluted than the loop?

If you want to stay with the list-comprehension-over-everything approach, you could factor away the inner loops to helper functions. This way it is much easier to digest:

def collect_words(file):
    ...

file_to_words_map = {
    filename: collect_words(open(filename))
    for filename in list_of_files if filename.endswith(".txt")
}

Btw., I don't think breaking such statements in multiple lines necessarily makes them clearer (instead, your urge to do so is quite telling). In the above example I intentionally rejoined the for and if part.

ypnos
  • 50,202
  • 14
  • 95
  • 141
  • the indentation was more for the purpose of letting readers of the question to be able to track the order of the statements and where they go when rewritten as a comprehension. What do you think it is telling of, that it is complicated enough it should be refactored? – Tadhg McDonald-Jensen Nov 09 '20 at 00:58
  • 3
    +1000 List comprehensions are intended to be a succinct way to create lists. Not to be another kind of loop. – Keith Nov 09 '20 at 01:07
  • @TadhgMcDonald-Jensen Yes, or to put it in other words, that you are using a tool intended to make certain statements more simple and straightforward for something that will not get more simply and straightforward by using this tool. Sometimes, the boring, old-fashioned way of writing it down is still the most elegant one. I understand your intention is to use these examples for teaching concepts but I am not sure how much you can convey through a statement that by-itself is just plain confusing to read, even to a seasoned developer. – ypnos Nov 11 '20 at 10:06
  • Btw. I do believe it is a fun exercise and found some entertainment in going through your `file_to_words_map` code. Rather than trying to simplify it I would probably use it for a "What is the output of this piece of code?", or even, "How to write this down in a more readable fashion? (hint: with loops)" homework. Definitely a good starter for discussions on code style. – ypnos Nov 11 '20 at 10:11
  • [this is how this typically comes up for me](https://stackoverflow.com/questions/64744298/list-comprehension-loop-ordering-depends-on-nesting/64744344?noredirect=1#comment114474360_64744298) idk if either of those would work, as a "what is the output?" question I feel like people will either get it correct or be confused about order of variable initialization (since the place where the variables get defined is already confusing enough with comprehensions). And I agree with fountainhead here, I don't think the nested comprehensions are too unreadable you just have to be expecting it. – Tadhg McDonald-Jensen Nov 11 '20 at 13:32
1

One approach is to use generators! Because you end up writing your code in a 'statement' basis instead of 'expression' basis it ends up being much more expandable.

def words_of_file(filename):
    """opens the file specified and generates all words present."""
    with open(filename) as file:
        for line in file:
            for word in line.split():
                yield word
                
def get_words_in_files(list_of_files):
    """generates tuples of form (filename, set of words) for all text files in the given list"""
    for filename in list_of_files:
        # skip non text files
        if not filename.endswith(".txt"):
            continue # can use continue to skip instead of nesting everything
        
        words_of_file = set(words_of_file(filename))
        # dict constructor takes (key,value) tuples.
        yield (filename, words_of_file)

file_to_words_map = dict(get_words_of_files(["a.txt", "b.txt", "image.png"]))

Using generators has a number of benefits:

  • we could use statements like with and continue and variable assignment and debugging print statements. All because we are in a block scope instead of an expression scope.
  • words_of_file just generates the words, it doesn't dictate that they must be put into a set. Some other code may choose to iterate over the words directly or pass it to the list constructor instead. Maybe a collections.Counter would be useful too! The point is that the generator lets the caller decide how to use the sequence.

This doesn't stop you from using comprehensions or other shortcuts either, if you want to yield all the elements of an iterator you can just yield from that iterator so you might end up with some code like this:

def words_of_file(filename):
    """opens the file specified and generates all words present."""
    with open(filename) as file:
        for line in file:
            # produces all words of the line
            yield from line.split()
            
file_to_words_map = {filename:set(words_of_file(filename))
                      for filename in list_of_files
                         if filename.endswith(".txt")
                     }

Different people have different opinions, I know my favourite is the generator only option because I am a very large fan of generators. I'm sure some people like the one liner solution of the nested comprehension, but this last version that uses simple comprehension and helper functions is probably what most people would be most comfortable with.

Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59
  • A very nice angle. Though, while `words_of_file` does look great, it misses the efficiency of storing directly into the set, which can be quite significant when you have many repetitions in the input. – ypnos Nov 11 '20 at 10:14
  • 1
    in contrast, `collections.Counter(words_of_file(name))` is slightly faster with the `yield from` version compared to `collections.Counter(word for line in file for word in line.split())` so unless you will always use a `set`, `list` or `dict` which have built in comprehensions, using a generator function instead of a generator expression doesn't incur any noticeable performance loss. – Tadhg McDonald-Jensen Nov 11 '20 at 13:44
  • Yes, there you use a generator to avoid an unnecessary intermediate container. Thanks for the great advice! – ypnos Nov 11 '20 at 14:07