1

Cannot figure out how can I add a string result ('color_hek') back to the dataframe (df) as a separate column?

#importing libraries
import pandas as pd 

#creating dataset
data = [[0,0,0], [0,0,0]] 
df = pd.DataFrame(data, columns = ['red', 'green', 'blue']) 

#defining function
def rgb_to_hex(red, green, blue):
    """Return color as #rrggbb for the given color values."""
    return '#%02x%02x%02x' % (red, green, blue)

#looping through the dataframe to apply the function
for index, row in df.iterrows():
    color_hek = rgb_to_hex(row['red'].astype(int),row['green'].astype(int),row['blue'].astype(int))
    print(color_hek)
eponkratova
  • 467
  • 7
  • 20
  • Related: [pandas create new column based on values from other columns / apply a function of multiple columns, row-wise](https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function-o/26887820#26887820) – smci Sep 28 '19 at 04:44
  • eponkratova: Neat question. I suggest you retitle it, since your main issue wasn't appending the result column (don't use print), but rather **calling a function of multiple arguments on the dataframe, row-wise**. And doing so efficiently, without `df.iterrows()` – smci Sep 28 '19 at 05:21

2 Answers2

1

You want to apply rgb_to_hex() to each row's 'red', 'green', 'blue' columns. This is a one-liner with apply(); never use .iterrows(), it's low performance, not vectorized and pretty much always avoidable.

# First, convert df 'red', 'green', 'blue' columns to `.astype(int)`

def rgb_to_hex(row):
    """Return color as #rrggbb for the given color values."""
    return '#%02x%02x%02x' % (row['red'], row['green'], row['blue'])

df['hek'] = df.apply(rgb_to_hex, axis=1)

You can make the code even more compact on this particular case as @cs95 showed, since you know your dataframe only has the columns 'red', 'green', 'blue', you can use * tuple unpacking on a row:

def rgb_to_hex(row):
    return '#%02x%02x%02x' % *row
smci
  • 32,567
  • 20
  • 113
  • 146
0

This is a simple problem of assignment, but you should not use iterrows, and especially not when you want to mutate your DataFrame.

Use a list comprehension instead and assign this back as a new column.

df['hex'] = [rgb_to_hex(*v) for v in df.values]
# Or, if you have more than three columns,
# df['hex'] = [rgb_to_hex(*v) for v in df[['red', 'green', 'blue']].values]

   red  green  blue      hex
0    0      0     0  #000000
1    0      0     0  #000000

From what I find, list comprehensions can be very fast as the next alternative to a non-vectorizable problem.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    This is the general shape of how you want to go about these things - except that when you use numpy/pandas, you want to use their tools for "broadcasting" rather than list comprehensions whereever possible. – Karl Knechtel Sep 28 '19 at 04:39
  • @KarlKnechtel Sure, but [`apply` is not a vectorised function](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code/54432584#54432584) and this isn't really a vectorizable problem (but I agree that vectorization > iteration and mentioned as such in my last sentence). – cs95 Sep 28 '19 at 04:40
  • @cs95: that trick only works in this case since it so happens that the dataframe only contains the three columns we need, in that order, and no other columns, so you can just directly use tuple-unpacking. But in the general case, when we want an arbitrary function of multiple inputs, operating on arbitrary columns of each row by name, such as [this](https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function-o), the function needs to reference columns by name: `row['red'], row['green'], row['blue']` – smci Sep 28 '19 at 04:58
  • Second point: your list-comprehension allocates a temporary which could get very large, on say a 4K image, or a Gb dataframe. – smci Sep 28 '19 at 05:00
  • @smci re:first comment, just take the columns you need first: `[... for v in df[['red', 'green', 'blue']].values]`. Re:second, if you have that much data you likely don't want to be using pandas in the first place (map reduce?). – cs95 Sep 28 '19 at 05:10
  • @cs95: `df[['red', 'green', 'blue']].values` is a creating another big temporary variable from a slice, it's not necessary. You kind of have a cavalier attitude to creating unnecessary temporaries which could be many Mb or Gb, on a large production-size dataframe; this can actually blow out physical memory. If we want our answers to be generic and reusable, we should at least note that limitation. – smci Sep 28 '19 at 05:12
  • Yes, because anything more than a couple million rows of data is really not what pandas was designed for, and while memory efficiency is important, performance is usually the first bottleneck before memory. – cs95 Sep 28 '19 at 05:39