Efficiently check if an array is jagged

Question

I'm looking for an efficient way to check if an array is jagged, where "jagged" means that an element of the array has a different shape from one of it's neighbors in the same dimension.

e.g. [[1, 2], [3, 4, 5]] or [[1, 2], [3, 4], [5, 6], [[7], [8]]]

Where I'm using list syntax for convenience, but the arguments may be nested lists or nested numpy arrays. I'm also showing integers for convenience, by the lowest-level components could be anything (e.g. generic objects). Let's say the lowest-level objects are not iterable themselves (e.g. str or dict, but definitely bonus points for a solution that can handle those too!).

Attemp:

Recursively flattening an array is pretty easy, though I'm guessing quite inefficient, and then the length of the flattened array can be compared to the numpy.size of the input array. If they match, then it is not jagged.

def really1d(arr):
    # Returns false if the given array is not 1D or is a jagged 1D array.
    if np.ndim(arr) != 1:
        return False
    if len(arr) == 0:
        return True
    if np.any(np.vectorize(np.ndim)(arr)):
        return False
    return True


def flatten(arr):
    # Convert the given array to 1D (even if jagged)
    if (not np.iterable(arr)) or really1d(arr):
        return arr
    return np.concatenate([flatten(aa) for aa in arr])


def isjagged(arr):
    if (np.size(arr) == len(flatten(arr))):
        return False
    return True

I'm pretty sure the concatenations are copying all of the data, which is a complete waste. Maybe there is an itertools or numpy.flatiter method of achieving the same goal? Ultimately the flattened array is only being used to find it's length.

This answer is in java, but it may help. https://stackoverflow.com/a/22874074/13314450 — CircuitSacul, May 24 '20 at 17:55
Thanks @LD, I did see a few of these answers for other languages, but I suspect that finding an efficient answer requires utilizing appropriate `numpy` or `itertools` methods that do this with unnecessarily copying data, which I think I'm doing in the attempt here. — DilithiumMatrix, May 24 '20 at 17:59
What you show are lists, but the tests all use `numpy`. A 'jagged' numpy array will have `object` dtype. Usually the shape is 1d as well (or at least fewer dimensions than expected). — hpaulj, May 24 '20 at 18:05
@hpaulj lists were shown for simplicity, question amended. If I generically knew the "expected" shape, the solution would be trivial. — DilithiumMatrix, May 24 '20 at 18:15
You seem to be trying to recreate `np.array's` parsing of the lists. Evidently `np.array` works its way down the nesting. The ideal is a multidimensional array of numeric (or string) dtype. But failing that, it falls back on some sort of object dtype (or raises an error). The details are in compiled code, which I've never tried to study. — hpaulj, May 24 '20 at 18:37
Given how general your problem is, I'm not sure that efficiency can be measured. You either have a lists of lists or an object dtype array (a numeric dtype can't be ragged. Iteration on object arrays is slower than iteration on a list. And the fast compiled numpy methods don't work in object arrays. — hpaulj, May 25 '20 at 01:04
@hpaulj interesting, that's very helpful - I didn't realized the numpy behaviors were so different for object arrays (though it now seems obvious that it *must* be so). Thanks! — DilithiumMatrix, May 25 '20 at 17:12

Eddie Bergman · Answer 1 · 2022-02-19T21:03:17.367

Perhaps not the most efficient but it works nicely in numpy. and will short circuit as soon as one of the conditions is False. If the first three conditions are True, we have no choice but to iterate through the rows.

Thankfull, all will shortcircuit as soon as one of the iterations is False so it won't check all the rows if it doesn't have to.

def jagged(x):
    x = np.asarray(x)
    return (
        x.dtype == "object"
        and x.ndim == 1
        and isinstance(x[0], list)
        and not all(len(row) == len(x[0]) for row in x) 
    )

If you wan't to squeeze more efficieny out, it's actually performing the len(x[0]) every iteration of the all part but this is probably inconsequential and this is a lot more legible than the alaternative which would have you write out the whole if statement.

hpaulj · Answer 2 · 2020-05-24T18:27:11.230

First what you show are lists, not arrays (but more on that later):

In [305]: alist1 = [[1, 2], [3, 4, 5]]                                                   
In [306]: alist2 = [[1, 2], [3, 4], [5, 6], [[7], [8]]]

Mixed len at the first level is a simple and obvious test

In [307]: [len(i) for i in alist1]                                                       
Out[307]: [2, 3]

but it's not enough with the 2nd example:

In [308]: [len(i) for i in alist2]                                                       
Out[308]: [2, 2, 2, 2]

Making an array from list1 produces a 1d object dtype:

In [310]: np.array(alist1)                                                               
Out[310]: array([list([1, 2]), list([3, 4, 5])], dtype=object)

list2 is 2d, but still object dtype:

In [311]: np.array(alist2)                                                               
Out[311]: 
array([[1, 2],
       [3, 4],
       [5, 6],
       [list([7]), list([8])]], dtype=object)

np.array is not the most efficient tool; while compiled, it does have to evaluate the nest list at least down to the level where it finds the discrepency.

If the list isn't ragged, at any level, the result is a numeric dtype:

In [321]: alist3 = [[1, 2], [3, 4], [5, 6], [7, 8]]                                      
In [322]: np.array(alist3)                                                               
Out[322]: 
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

If the list elements are arrays, there can be a further result - a broadcasting error. This is results when the first dimensions match, but the differences are in the lower level(s).

In sum, if it is already a numpy array, then object is a good indicator, especially if you were expecting a numeric dtype. If the lowest level elements might themselves be objects (other than lists) this won't help. In both the list1 and list2 cases, some or all of the lowest level elements are objects - lists.

If it's a list of lists, then recursive evaluation of the len is probably the way to go. But only time tests can prove that this is better than np.array(alist).

Could you tell me how this solution is any different than mine? — Anwarvic, May 24 '20 at 18:26

score 0 · Answer 3 · answered May 24 '20 at 19:21

Apologies if I phrased the question too vaguely, but I need a solution much more general than only for the given (list of integer list) examples.

I'm still guessing there might be a better solution, but here is a significant improvement that definitely does not duplicate the input in memory:

def really1d(arr):
    if np.ndim(arr) != 1:
        return False
    if len(arr) == 0:
        return True
    if np.any(np.vectorize(np.ndim)(arr)):
        return False
    return True


def flatlen(arr):
    # NOTE: If you know your base types are NOT iterable (e.g. not `str`, or `dict`, etc)
    # Then you might be able to get away with:
    # if not np.iterable(arr):

    # This will work for my cases (catching possible `str` and `dict` types)
    if np.isscalar(arr) or isinstance(arr, dict):
        return 1

    if really1d(arr):
        return len(arr)

    return np.sum([flatlen(aa) for aa in arr])


def isjagged(arr):
    if np.isscalar(arr) or (np.size(arr) == flatlen(arr)):
        return False
    return True

If `arr` is a list, `np.ndim` and `np.vectorize` (and `np.size`) will each convert it to an array before doing their thing. That makes a copy (though in the case of an object array that will just a copy of the list references). I'd stay away from `numpy` functions if I wanted to efficiently check a list of lists - or do one conversion at the start. — hpaulj, May 25 '20 at 04:29

score 0 · Answer 4 · answered May 24 '20 at 21:17

Here's a different way to approach to the problem. It aims for a bit more generality (no numpy assumptions) and code simplicity. It ignores the efficiency issue that you raised a few times: it does not flatten or copy the data, but it does contruct a parallel data structure to make the test for jaggedness easy.

def simplified(xs):
    # Takes a value and returns it in recursively simplfied form.
    # Array-like values (list, tuple, str) become tuples.
    # All other values (and single characters) become None.
    if isinstance(xs, (list, tuple)):
        return tuple(simplified(x) for x in xs)
    elif isinstance(xs, str):
        return tuple(None for x in xs)
    else:
        return None

def is_jagged(xs):
    # Takes a simplified value.
    # Non-jagged structures will have the same form at the top level.
    return len(set(xs)) > 1

Demo:

tests = (
    # Non-jagged.
    (False, []),
    (False, [[], [], []]),
    (False, [1, 2, 3]),
    (False, [[1, 2], [3, 4]]),
    (False, [[1, 2], [3, 4], [5, 6], [7, 8]]),
    (False, ('ab', 'cd')),
    (False, (['ab', 'cd', 'ef'], ('gh', 'ij', 'kl'))),
    # Jagged.
    (True, [1, 2, [3, 4]]),
    (True, [[1, 2], [3, 4, 5]]),
    (True, [[1, 2], [3, 4], [5, 6], [[7], [8]]]),
    (True, ('ab', 'cdefg')),
)
fmt = '\nInput:      {}\nSimplified: {}\nIs jagged:  {} [{}]'
for exp, xs in tests:
    sim = simplified(xs)
    isj = is_jagged(sim)
    msg = fmt.format(xs, sim, isj, 'ok' if isj == exp else 'DOH')
    print(msg)

Output:

Input:      []
Simplified: ()
Is jagged:  False [ok]

Input:      [[], [], []]
Simplified: ((), (), ())
Is jagged:  False [ok]

Input:      [1, 2, 3]
Simplified: (None, None, None)
Is jagged:  False [ok]

Input:      [[1, 2], [3, 4]]
Simplified: ((None, None), (None, None))
Is jagged:  False [ok]

Input:      [[1, 2], [3, 4], [5, 6], [7, 8]]
Simplified: ((None, None), (None, None), (None, None), (None, None))
Is jagged:  False [ok]

Input:      ('ab', 'cd')
Simplified: ((None, None), (None, None))
Is jagged:  False [ok]

Input:      (['ab', 'cd', 'ef'], ('gh', 'ij', 'kl'))
Simplified: (((None, None), (None, None), (None, None)), ((None, None), (None, None), (None, None)))
Is jagged:  False [ok]

Input:      [1, 2, [3, 4]]
Simplified: (None, None, (None, None))
Is jagged:  True [ok]

Input:      [[1, 2], [3, 4, 5]]
Simplified: ((None, None), (None, None, None))
Is jagged:  True [ok]

Input:      [[1, 2], [3, 4], [5, 6], [[7], [8]]]
Simplified: ((None, None), (None, None), (None, None), ((None,), (None,)))
Is jagged:  True [ok]

Input:      ('ab', 'cdefg')
Simplified: ((None, None), (None, None, None, None, None))
Is jagged:  True [ok]

Thanks for the answer. As I provided a *working* solution with the question, the *efficiency* part is pretty important... thus an answer that "ignores the efficiency issue" is't really an answer. — DilithiumMatrix, May 25 '20 at 00:33
Perhaps, but StackOverflow is many things: imagine future users who end up looking at this question with slightly different needs than yours. Speaking of efficiency, if you want to good answers on that front, we need more information: what aspects of the problem should be prioritized over others; what kind(s) of efficiency matter most; can the problem be parallelized; what data types must be supported; and what benchmarks do you have (how fast/efficient/whatever is your current approach and what target would you like to hit). Without such details, talk of efficiency is often empty. — FMc, May 25 '20 at 01:23

Anwarvic · Answer 5 · 2020-05-24T18:19:37.690

-2

There is a super simple way that can be done in a line.. just one line. It's not perfect, but it's super simple.

You can do that using np.array. If all elements of the nested-list have the same shape, the dimensions of the resulting array will be more than one dimension. But if just one element has a different number of elements, the returned array will be of one dimension.

See this example for more understanding:

>>> import numpy as np
>>> lst = [[1, 2], [3, 4, 5]]
>>> arr = np.array(lst)
>> arr.shape
(2,)
>>> arr.dtype
object

>>> lst = [[0, 1, 2], [3, 4, 5]]
>>> arr = np.array(lst)
>>> arr.shape
(2, 3)
>>> arr.dtype
int64

So, your function will be written in just one line like so:

def isjagged(lst):
    return len(np.array(lst).shape) == 1

NOTE: Of course this will only work on nested-lists

EDIT

As said by @Ch3steR, this solution will work with simple nested-list and it won't work for a little bit complicated ones like this one: [[1, 2], [3, 4], [5, 6], [[7], [8]]].

So, I think this could be a nicer solution:

def isjagged(lst):
    return np.array(lst).dtype == 'object'

edited May 24 '20 at 18:19

answered May 24 '20 at 18:02

Anwarvic

12,156
4
49
69

1

No this will not work, try with this `np.array([[1, 2], [3, 4], [5, 6], [[7], [8]]])`, shape is `(4,2)` and it's *jagged*. – Ch3steR May 24 '20 at 18:09
yeah... you're right! I could use the `dtype` instead so it becomes: `return np.array(lst).dtype == 'object'` What do you think? – Anwarvic May 24 '20 at 18:10
I'm sorry if my examples were poorly chosen. The solution I need in particular, *must* be able to work on a simple case like, `np.empty((2, 2), dtype=object)` which I would certainly call *not*-jagged. – DilithiumMatrix May 24 '20 at 19:23
But what's the purpose of checking for jaggedness? Just a programming exercise, or because the next operation needs that information? Many array and tensor operations will choke on a jagged list. – hpaulj May 25 '20 at 03:41

Efficiently check if an array is jagged

5 Answers5

EDIT