-2

Here is the data I have in a Pandas DataFrame:

ID | Min | Max
--------------
1  |  1  | 10
2  |  54 | 105
3  |  24 | 0
.  |  .  | .
.  |  .  | .
.  |  .  | .
N  |  X  | Y

Here is the output DataFrame I'm trying to get:

ID | Min | Max | All Numbers in Range
---------------------------------------
1  |  1  | 10  | [1,2,3,4,5,6,7,8,9,10]
2  |  54 | 105 | [54,55,56,...,104,105]
3  |  24 | 0   | [1,2,3,...,22,23,24]
.  |  .  | .   |           .
.  |  .  | .   |           .
.  |  .  | .   |           .
N  |  X  | Y   | [X, ...............,Y]

I can do this with a loop and generate the lists (or Numpy arrays) row by row, but its very slow and it will take two hours to complete with the amount of data I have. I can also do this with Apply, but its no faster than the loop. And I can't seem to figure out how to vectorize this operation so it happens faster.

Here is one of the ways I've tried to vectorize it that didn't work:

def create_list(min, max):
    if max != 0:
        num_list= np.arange(min, max + 1, 1)
    else:
        num_list= np.arange(1, min + 1, 1)

    return num_list


df["num_list"] = create_list(df["min"], df["max])

Which gives me the error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any help would be appreciated?

Edit: My current solution before posting (no faster than a loop using iterrows):

def create_list(min, max):
    if max != 0:
        num_list= np.arange(min, max + 1, 1)
    else:
        num_list= np.arange(1, min + 1, 1)

    return num_list


df["num_list"] = df.apply(lambda row: create_list(row["min"], row["max"]), axis = 1)
AMC
  • 2,642
  • 7
  • 13
  • 35
  • Does this answer your question? [Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) – AMC Apr 17 '20 at 22:39
  • How big is the dataframe? What are the ranges for? – AMC Apr 17 '20 at 22:39
  • Hi AMC, unfortunately that other post doesn't solve my problem. As for the size of the data, its tens of millions of records. – greatbigfoot Apr 17 '20 at 22:51
  • If it matters I'm also getting the same ValueError when removing the conditions and just trying to use np.arange on the dataframe: np.arange(df["min"], df["max"], 1) np.arange(df["min"].values, df["max"].values, 1) Both of these attempts give ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() – greatbigfoot Apr 17 '20 at 22:57

3 Answers3

1

You can do:

def create_list(row):
    if row['max'] != 0:
        num_list= np.arange(row['min'], row['max'] + 1, 1)
    else:
        num_list= np.arange(1, row['min'] + 1, 1)

    return num_list


df["num_list"] = df.apply(create_list, axis=1)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

Solution

df['All Numbers in Range'] = df.apply(lambda row: range(row['Min'], row['Max']+1) \ 
                                                  if row['Max']!=0 else \
                                                  range(1, row['Min']+1), 
                                      axis=1)

CypherX
  • 7,019
  • 3
  • 25
  • 37
0

You might consider first standardising your data, then performing a more simple operation on all rows.

# use boolean indexing to identify and fix your special cases (max == 0)
maxIsZero = df['max'] == 0
df.loc[maxIsZero, 'max'] = 1
df.loc[maxIsZero, ['min', 'max']] = df.loc[maxIsZero, ['max', 'min']].to_numpy()


# now your min & max columns are semantically consistent
   min  max
0    1   10
1   54  105
2    1   24

# create a new range column using a helper function
def generate_range(r):
    return np.arange(r['min'], r['max']+1, 1)

df['num_list'] = df.apply(generate_range, axis=1)
Jon R
  • 1,207
  • 1
  • 9
  • 8