0

I have a DataFrame in a variable called "myDataFrame" that looks like this:

+---------+-----+-------+-----
| Type    | Count  |  Status |
+---------+-----+-------+-----
| a       |  70    |     0   |
| a       |  70    |     0   |
| b       |  70    |     0   |
| c       |  74    |     3   |
| c       |  74    |     2   |
| c       |  74    |     0   |
+---------+-----+-------+----+

I am using vectorized approach to process the rows in this DataFrame since the amount of rows I have is about 116 million.

So I wrote something like this:

myDataFrame['result'] = processDataFrame(myDataFrame['status'], myDataFrame['Count'])

In my function, I am trying to do this:

def processDataFrame(status, count):
    resultsList = list()
    if status == 0:
       resultsList.append(count + 10000)
    else:
       resultsList.append(count - 10000)

    return resultsList

But I get this for comparison status values:

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

What am i missing?

Kedar Joshi
  • 1,182
  • 1
  • 20
  • 27

2 Answers2

5

We can do without self-def function

myDataFrame['result'] = np.where(myDataFrame['status']==0,
                                 myDataFrame['Count']+10000,
                                 myDataFrame['Count']-10000)

Update

df.apply(lambda x : processDataFrame(x['Status'],x['Count']),1)
0    [10070]
1    [10070]
2    [10070]
3    [-9926]
4    [-9926]
5    [10074]
dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks. While this is a good idea, I am not really worried about the result column. I do lots of other things in my function. Is there any other way to do it inside the function i mentioned? – Kedar Joshi Jun 05 '20 at 23:33
  • Basically, that degrades the performance. I read lots of posts where its mentioned that using vectorized approach is the best possible solution considering the amount of data i have (116 million rows).. I would like to do it with "vectorization" technique only, please? – Kedar Joshi Jun 05 '20 at 23:39
  • 1
    @Deadman self-def function can not be vectorized , only if you can do with original panda function – BENY Jun 05 '20 at 23:40
  • Sure, what do you mean by "original panda function". Can you give me an example please? – Kedar Joshi Jun 05 '20 at 23:41
  • like panda where , mask and other @Deadman original ==build-in – BENY Jun 05 '20 at 23:42
0

I think your function is not really doing the vectorized part.

When it is called, you pass status = myDataFrame['status'], so when it gets to the first if, it checks the condition of myDataFrame['status'] == 0. But myDataFrame['status'] == 0 is a boolean series (of whether each element of the status column equals 0), so it doesn't have a single Truth value (hence the error). Similarly, if the condition could be met, the resultsList would just get the whole "Count" column appended, either all plus 10000 or all minus 10000.


Edit:

I suppose this function uses the built in pandas functions, but applies them in your function:

def processDataFrame(status, count):
    status_0 = (status == 0)
    output = count.copy() #if you don't want to modify in place
    output[status_0] += 10
    output[~status_0] -= 10 
    return output
Tom
  • 8,310
  • 2
  • 16
  • 36
  • Ok, what would be the vectorized solution? – Kedar Joshi Jun 05 '20 at 23:37
  • edited to add an option; I guess per your comments with @YOBEN_S, you could change the computations to be other processing, but you would have to rely on `pandas`/`numpy` functions, and not loops/iteration – Tom Jun 06 '20 at 00:02
  • I am not too familiar with this stuff tbh, but [this answer](https://stackoverflow.com/a/52674448/13386979) seems informative – Tom Jun 06 '20 at 00:03