Pandas, for each row getting value of largest column between two columns

Question

I'd like to express the following on a pandas data frame, but I don't know how to other than slow manual iteration over all cells.

For context: I have a data frame with two categories of columns, we'll call them the read_columns and the non_read_columns. Given a column name I have a function that can return true or false to tell you which category the column belongs to.

Given a specific read column A:
    For each row:
        1. Inspect the read column A to get the value X
        2. Find the read column with the smallest value Y that is greater than X.
            If no read column has a value greater than X, then substitute the largest value
            found in all of the *non*-read columns, call it Z, and skip to step 4.
        3. Find the non-read column with the greatest value between X and Y and call its value Z.
        4. Compute Z - X

At the end I hope to have a series of the Z - X values with the same index as the original data frame. Note that the sort order of column values is not consistent across rows.

What's the best way to do this?

why the downvotes? seems like a normal programming question rigorously stated...? — Joseph Garvin, Nov 01 '17 at 19:19
Can you create sample inputs and expected outputs? [How to ask a good Pandas question?](https://stackoverflow.com/a/20159305/6361531) — Scott Boston, Nov 01 '17 at 19:25
I think what I've stated completely unambiguously describes the problem. The lack of example data may be a good reason to not spend time answering the question if you're someone who otherwise might be interested but I don't see why it merits downvotes. Now the question will just be buried and even if I add data won't get the visibility to actually be answered. — Joseph Garvin, Nov 01 '17 at 19:35
"What's the best way to do this" is extraordinarily broad and gives anyone answering zero direction. Providing an example and an attempt can give us a starting point to benchmark against. — Andrew L, Nov 01 '17 at 19:39
@AndrewL expressing it in a way that is natural in pandas and thus faster than manual iteration. I think this is clear and you're being pedantic for its own sake. "best way" is an idiom. — Joseph Garvin, Nov 01 '17 at 20:23
Well, I made an attempt to answer it, but I am starting to think, that my answer is invisible. @JosephGarvin, any comments on it? — wombatonfire, Nov 01 '17 at 21:53

wombatonfire · Answer 1 · 2017-11-01T19:54:29.423

It's hard to give an answer without looking at the example DF, but you could do the following:

Separate your read columns with Y values into a new DF.
Transpose this new DF to get the Y values in columns, not in rows.
Use built-in vectorized functions on the Series of Y values instead of iterating the rows and columns manually. You could first filter the values greater than X, and then apply min() on the filtered Series.

Pandas, for each row getting value of largest column between two columns

1 Answers1