Python and Pandas - Populate a Pandas DataFrame column with a list of of numbers

Question

Here is the data I have in a Pandas DataFrame:

ID | Min | Max
--------------
1  |  1  | 10
2  |  54 | 105
3  |  24 | 0
.  |  .  | .
.  |  .  | .
.  |  .  | .
N  |  X  | Y

Here is the output DataFrame I'm trying to get:

ID | Min | Max | All Numbers in Range
---------------------------------------
1  |  1  | 10  | [1,2,3,4,5,6,7,8,9,10]
2  |  54 | 105 | [54,55,56,...,104,105]
3  |  24 | 0   | [1,2,3,...,22,23,24]
.  |  .  | .   |           .
.  |  .  | .   |           .
.  |  .  | .   |           .
N  |  X  | Y   | [X, ...............,Y]

I can do this with a loop and generate the lists (or Numpy arrays) row by row, but its very slow and it will take two hours to complete with the amount of data I have. I can also do this with Apply, but its no faster than the loop. And I can't seem to figure out how to vectorize this operation so it happens faster.

Here is one of the ways I've tried to vectorize it that didn't work:

def create_list(min, max):
    if max != 0:
        num_list= np.arange(min, max + 1, 1)
    else:
        num_list= np.arange(1, min + 1, 1)

    return num_list


df["num_list"] = create_list(df["min"], df["max])

Which gives me the error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any help would be appreciated?

Edit: My current solution before posting (no faster than a loop using iterrows):

def create_list(min, max):
    if max != 0:
        num_list= np.arange(min, max + 1, 1)
    else:
        num_list= np.arange(1, min + 1, 1)

    return num_list


df["num_list"] = df.apply(lambda row: create_list(row["min"], row["max"]), axis = 1)

Does this answer your question? [Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) — AMC, Apr 17 '20 at 22:39
Hi AMC, unfortunately that other post doesn't solve my problem. As for the size of the data, its tens of millions of records. — greatbigfoot, Apr 17 '20 at 22:51
If it matters I'm also getting the same ValueError when removing the conditions and just trying to use np.arange on the dataframe: np.arange(df["min"], df["max"], 1) np.arange(df["min"].values, df["max"].values, 1) Both of these attempts give ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() — greatbigfoot, Apr 17 '20 at 22:57

score 1 · Answer 1 · answered Apr 17 '20 at 21:10

1

You can do:

def create_list(row):
    if row['max'] != 0:
        num_list= np.arange(row['min'], row['max'] + 1, 1)
    else:
        num_list= np.arange(1, row['min'] + 1, 1)

    return num_list


df["num_list"] = df.apply(create_list, axis=1)

answered Apr 17 '20 at 21:10

Quang Hoang

146,074
10
56
74

CypherX · Answer 2 · 2020-04-17T23:13:09.933

1

Solution

df['All Numbers in Range'] = df.apply(lambda row: range(row['Min'], row['Max']+1) \ 
                                                  if row['Max']!=0 else \
                                                  range(1, row['Min']+1), 
                                      axis=1)

edited Apr 17 '20 at 23:13

answered Apr 17 '20 at 21:11

CypherX

7,019
3
25
37

This doesn't reflect the `if max!=0` condition :-) – Quang Hoang Apr 17 '20 at 21:13
@QuangHoang Thank you for pointing it out. Updated the solution for **`max!=0`**. – CypherX Apr 17 '20 at 23:14

score 0 · Answer 3 · answered Apr 17 '20 at 23:32

You might consider first standardising your data, then performing a more simple operation on all rows.

# use boolean indexing to identify and fix your special cases (max == 0)
maxIsZero = df['max'] == 0
df.loc[maxIsZero, 'max'] = 1
df.loc[maxIsZero, ['min', 'max']] = df.loc[maxIsZero, ['max', 'min']].to_numpy()


# now your min & max columns are semantically consistent
   min  max
0    1   10
1   54  105
2    1   24

# create a new range column using a helper function
def generate_range(r):
    return np.arange(r['min'], r['max']+1, 1)

df['num_list'] = df.apply(generate_range, axis=1)

Python and Pandas - Populate a Pandas DataFrame column with a list of of numbers

3 Answers3

Solution