1

I have many cases in a certain set of data that a value could be either a list or a singular value of the same type (if someone needs context, they come from an ElasticSearch DB). For instance (not valid json, just to illustrate the idea)

var_of_data_type_x = {
   item_a: { data_structure_a }
}

or 

var_of_data_type_x = { 
   item_a: [
      { data_structure_a },
      { data_structure_a },
      { data_structure_a }
   ]
}

to make matters worse, data_structure_a fields could be similar, up to scalar/list of scalar level, maybe nested for 2-3 levels.

So all my processing code needs to verify if an item is a list or a singular value and unwrap the list if necessary in the style shown below. This means a lot of code duplication, unless I create so many tiny functions (each processing code is around 5-10 lines in most cases). Even If i moved common code to functions, the pattern shown below gets repeated, sometimes even nested for 2-3 levels.

# list-checking-code

if instanceof(var, list):
   for x in var:
      # item wise processing code for (x) ...
else:
   # exactly same code as above for (var)

I know, this is a nightmare design, I'd rather the data structures be consistent, but this is my input. I could write some simple preprocessing to make it consistent, to make all singular instances wrapped in lists. That would create a lot of single-element lists though, as in many cases the values are singular.

What would be the best approach for tackling this? So far, all approaches I see have their own problems:

  1. creating double code (as above) for list vs singular cases: probably the most efficient, but readability hell as this happens a lot, especially nested! This is my preferred method for efficiency reasons although it's a code/maintain nightmare.
  2. preprocess data and wrap each singular item in a list: not sure how efficient creating a lot of single-element lists is. Also, most such items in data will be accessed only once.
  3. write a lot of functions for itel-level processing, which will save some complexity of code, but add a lot of 5-10 line functions.
  4. do (3) above, additionally move above #list-checking-code pattern to another function, which will take function in (3) as an argument.
  5. write functions to accept var-args, and pass all arguments as unwrapped lists. This will eliminate the instanceof() check and if-then-else but not sure if unwrapping has its own overhead. (The lists in question have very few elements typically.)

What could be the best approach here, or is there a better more pythonic way? Performance and efficiency are concerns.

Greenleaf
  • 13
  • 2
  • 3
    `for x in ensure_list(foo):`, where that’s a simple helper function like `return foo if isinstance(foo, list) else [foo]`…? – deceze Jan 21 '21 at 20:52
  • 2
    I would not start with concerns about efficiency - this is premature optimization. Start by coming up with the interfaces and interactions that make the most sense, communicate your intent most effectively, etc, and then build those. If you've defined them correctly, making them efficient will be something you can do when performance tells you it's time to do so – Jon Kiparsky Jan 21 '21 at 20:52
  • @JonKiparsky I agree with this, I've reasonable Java experience - however very new to python was wondering if there's a natural pythonic way that could look at this problem in a way I don't see. – Greenleaf Jan 22 '21 at 07:54
  • @JonKiparsky For instance, if there was a syntactic way to just treat a singleton variable as a list (like list unwrap * operator works on them without any fuzz) then it would have made my life very easy. – Greenleaf Jan 22 '21 at 08:04

2 Answers2

1

I would like to be able to assume that your access to Elasticsearch is mediated by some code that allows the rest of your code to not know or care that Elasticsearch is involved. If that were the case, then the problem would be pretty simple: that code should always return data as a list.

However, since you're asking the question, I suspect that this is not the case, and you have lots of code that knows about Elasticsearch, and talks to it. If that is the case, then a utility function is probably the simplest solution here. Something like:

def oughta_be_a_list(input):
    if isinstance(input, list):
        return input
    else:
        return [input]

(the names should be changed to ones that suit your local naming conventions, of course)

You would then use that every time you access your data source. Messy, but this is one of the reasons why we like to isolate that sort of code!

Jon Kiparsky
  • 7,499
  • 2
  • 23
  • 38
  • You have wrapped the arguments in `isinstance()`. Would it be more effective to replace `return [input]` with `yield input`? – Jolbas Jan 21 '21 at 21:20
  • I'm not sure what "more effective" means for you. If I understand you correctly, we'd `yield` just in the case where we get a non-list as our input. That means this function would behave sometimes like a function and sometimes like a generator, which sounds unpleasant and confusing to me. But perhaps I've misunderstood your meaning. – Jon Kiparsky Jan 21 '21 at 21:49
  • I meant that we would be a shortcut to just yield the input instead of creating a list which then is turned into a iterator in the ´for x in oughta_be_a_list(input)`. I agree it may be a little confusing behaviour which is close to less effective in another sense. – Jolbas Jan 21 '21 at 22:01
  • @JonKiparsky The `isinstance` args seem to be in wrong order. – VPfB Jan 22 '21 at 06:58
  • @JonKiparsky yes, ES is mediated. I can do what you suggest, this is in fact the solution (2) that I mentioned above. However that involves creating a lot of new lists to wrap single-element data instances. (majority of them are single-element data). Does this have a performance impact in python? – Greenleaf Jan 22 '21 at 07:44
  • @VPfB Thanks, fixed. I'll excuse my mistake by saying that I prefer to avoid writing code that requires resorting to `isinstance`, which is why I'm not so familiar with the args structure of that function :) – Jon Kiparsky Jan 22 '21 at 13:37
  • @Greenleaf I think it's extremely unlikely would cause a significant performance impact, but that's something you could measure. The first question, of course, is "do I have a performance problem at all?" If not, you can take the time you're spending thinking about performance and put it into making new features. If so, you can do some profiling and determine where your slowness is happening. – Jon Kiparsky Jan 22 '21 at 13:45
  • @JonKiparsky I guess I'd just measure the performance someday if the code feels slow; just wanted to know if there was any python-specific way to handle this better. – Greenleaf Jan 23 '21 at 17:03
0

If I understand you right you will handle all leaf nodes the same no matter how deep in the tree it is. Then maybe some kind of recursive function yielding all objects that is not a list. Copied from this answer.

def deep_iter(var):
    if isinstance(var, list):
        for a in var:
            yield from deep_iter(a)
    else:
        yield input
Jolbas
  • 757
  • 5
  • 15