0

First, I promise I looked for the answer first. I only started learning R yesterday and the question very basic. So either it's too basic to have been asked or I don't realize I'm reading the answer.

Basically, this is the task in this exercise (don't worry, it's not a test):

You can also use themutate() function to make changes to your columns. Let's say you wanted to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. Modify the code chunk below to create that new column:

example_df <- bookings_df %>%
  mutate(guests = )

head(example_df)

There are three columns ("adults", "children", and "babies") whose values I want to add in the new column called "guests". So i tried

example_df <- bookings_df %>%
  mutate(example_df, guests = c("adults", "children", "babies"))

head(example_df)

And of course, it comes back with an error "guests must be size 119390 or 1, not 3."

Now, up until this point, I haven't learned any advanced functions. So the answer is going to be something extremely basic for a first week R student.

Help?

zephryl
  • 14,633
  • 3
  • 11
  • 30
G. R.
  • 5
  • 2
  • mutate can also be used to create a new column. `example_df <- bookings_df %>% mutate(guests = sum(adults, children, babies))` might work. For future reference, it is best practice to provide some sample data so those who help you don't have to do as much work. For that, look into `dput()` – TTS Dec 28 '22 at 00:20
  • 1
    @TTS you should add `rowwise()` before summation. Otherwise, you will get a single value indicating the total of all columns. – Darren Tsai Dec 28 '22 at 04:07

1 Answers1

2

Assuming the data in the "adults", "children", and "babies" columns are the numeric counts of how many there are for each reservation, here are two simple solutions:

# solution 1: use the addition operator
example_df <- bookings_df %>% mutate(guests = adults + children + babies)

# solution 2: use the rowwise() and sum() functions
example_df <- bookings_df %>% rowwise() %>% mutate(guests = sum(adults, children, babies))

Note that the rowwise() in Solution 2 allows the summation performed through the sum() function to be row-wise. If rowwise() is not specified for Solution 2, the sum() command will sum everything from all three columns together and provide the single answer (total sum) for all rows in the column guest.


Explaining the error in your attempt

c("adults", "children", "babies") in your attempt outputs the three words that you typed as three separate outputs. Thus, you received the error, "guests must be size 119390 or 1, not 3." It can be 119390 outputs because bookings_df has 119390 rows so it needs that many outputs, or 1 output so that all the outputs in the new column are the same.

c() also does not perform the mathematical addition of those three words. Run ?c in R for a helpful description on how to use c().

The quotations around the items "adults", "children", and "babies" further make those items as just words, or character class items. That means that the three words do not refer to columns in your data frame (though there is a longer roundabout way to make the character words refer to the column names in the data frame with other functions, that's likely beyond the scope of your exercise here).

LC-datascientist
  • 1,960
  • 1
  • 18
  • 32
  • 2
    The second code (`guests = sum(...)`) is going to calculate a single value and repeat it for every row in the frame. While it's a valid "statistic" in the pure sense, I wonder if the OP and/or intent of this answer is to more display the `+`-based answer, providing row-wise sums. – r2evans Dec 28 '22 at 01:04
  • The first one is exactly what I was looking for, thank you! As an aside, solution #2 returned all NAs. Not sure if the reason relates to @r2evans 's comment, but for my education, could one of you explain to me why it would do that? It seems to me sum() should work there too, no? – G. R. Dec 28 '22 at 01:41
  • 1
    (1) If _any_ value in any of those columns is `NA`, then you cannot get a sum without also specifying `na.rm=TRUE` to ignore them. Think of it this way: an `NA` values really means *"it could be anything"*. When all numbers are non-`NA`, then `sum(.)` on those numbers is well defined. However, if anything is "it could be anything", then the sum also "could be anything" (inf, negative inf, 0, 42, who knows). (2) `sum(.)` in a `mutate` call is almost never what is needed; the exception is when `rowwise()` or `group_by(.)` is used preceding the mutate call. – r2evans Dec 28 '22 at 01:45
  • Thanks for the comments. I was negligent and didn't check the output when I posted. I provided a correction to my Solution 2. As @r2evans described, if you have any missing values or `NA`, then the `sum()` would return `NA`. You may specify the parameter `na.rm=TRUE` in `sum()` (i.e., `sum(..., na.rm=TRUE)`) to ignore missing values. – LC-datascientist Dec 28 '22 at 02:19
  • But a word of caution when using `sum(..., na.rm=TRUE)`: the amount of data that you sum together may not be the same for every row. E.g., If Row 1 reports 20 guests in a child's birthday party and Row 2 reports 1 guest in another child's party, it may be because the number of children in the second party was not reported (`NA` value). We cannot confirm from the data whether it was a very small party or a very big party with 100 children because we do not have the data. – LC-datascientist Dec 28 '22 at 02:26
  • Other choices: `bookings_df %>% mutate(guests = rowSums(cbind(adults, children, babies)))` or `bookings_df %>% mutate(guests = rowSums(across(c("adults", "children", "babies"))))`. If there are many columns to be sum, the use of `across()` is flexible because it supports *tidy selections*, and it makes the character words refer to the column names in the data frame. – Darren Tsai Dec 28 '22 at 04:16