-1

I have the dataset below (with more columns) and would like to create a new column and add the number of trips per day there. I guess I have to count the column "Date", I've tried few codes, but no success.

Trip Duration  Date         Start Station ID
680            01/01/2017        495
1282           01/01/2017        498
648            01/01/2017        333
632            02/01/2017        348
583            02/01/2017        958
589            03/01/2017        019
584            03/01/2017        496
867            03/01/2017        400
594            01/02/2017        419
587            01/02/2017        409
948            02/02/2017        395
476            02/02/2017        598

Many thanks

markus
  • 25,843
  • 5
  • 39
  • 58
Paola
  • 1
  • Welcome to SO! Your desired output is a data.frame with only two columns with dates and count? What about Trip Duration and start station? Are you dropping them or somewhat aggregate them? – s__ Nov 30 '18 at 21:34
  • Thanks for your answer, I'm new here and also new doing that. I have more columns: Trip duration, Date, Start time, Stop time, Start Station ID, Start Station name, etc. I thought I shouldn't be using them to count the number of trips. I will also merge my weather and bike trips dataset. I was thinking in before that to create a column with the number of trips to use as a dependent variable in the future vs the independent variables such as temperature, wind, precipitation etc (multilinear regression). Would I need another column to count number of trips the day? – Paola Nov 30 '18 at 21:43
  • I guess if I merge later the weather dataset with bikes dataset considering the column "Date" the one that they have in common I won't have a redundant number of trips as the weather has just 365 date (one row per day of the year). Is that right? I'm trying to use Bruno's code, but it's giving me an error: Error in UseMethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "function" > – Paola Nov 30 '18 at 22:06
  • Hi @s_t, thank you for being very patient with me. The idea is to have a column with the number of trips per day one and them merge this column with my wether dataset. So I will be able to work with the number of trips as a dependent variable and some others weather columns as independent variables for a linear regression. – Paola Nov 30 '18 at 22:25
  • I've tried your code anyway after changing the names of the variable Date and got: Error in `colnames<-`(`*tmp*`, value = c(NA, "date_trip")) : attempt to set 'colnames' on an object with less than two dimensions – Paola Nov 30 '18 at 22:25
  • > colnames(df)[2] <-'date_trip' Error in `colnames<-`(`*tmp*`, value = c(NA, "date_trip")) : attempt to set 'colnames' on an object with less than two dimensions – Paola Nov 30 '18 at 22:45
  • I think you have some problem with the original names of your columns. You can try to dput (head (df,10)) and update your answer with the result of it ( kind of structure...): it's going to give the exact first 10 rows of your data to work with and you column names. As you can see our codes works wit names without spaces. – s__ Nov 30 '18 at 22:57
  • You were right @s_t. Bruno's code is running now, I just don't know if when I' using view(df) I'd be able to see this new column "Trip" there because I can't. Even though I can see this as a last column on the console area. I'm also trying to merge the datasets now, but I don't think the new column "Trips" is there as well. I didn't use your code with average because I really need a column with a number of trips per day to run the model.... still stuck here :-( – Paola Nov 30 '18 at 23:34
  • please read under my answer the comment. – s__ Dec 03 '18 at 08:09

2 Answers2

1

Let's first make your data reproducible. Always concerns about this. It's very important offer a reproducible example for who could spend some time to help.

df <- read.table(
  text = "
  Trip_Duration  Date         Start_Station_ID
  680            01/01/2017        495
  1282           01/01/2017        498
  648            01/01/2017        333
  632            02/01/2017        348
  583            02/01/2017        958
  589            03/01/2017        019
  584            03/01/2017        496
  867            03/01/2017        400
  594            01/02/2017        419
  587            01/02/2017        409
  948            02/02/2017        395
  476            02/02/2017        598
  ", header = TRUE)

Using dplyr you can do what you want like this:

df %>% 
  group_by(Date) %>% 
  mutate(Trips = n()) %>% 
  ungroup()

# A tibble: 12 x 4
   Trip_Duration Date       Start_Station_ID Trips
           <int> <fct>                 <int> <int>
 1           680 01/01/2017              495     3
 2          1282 01/01/2017              498     3
 3           648 01/01/2017              333     3
 4           632 02/01/2017              348     2
 5           583 02/01/2017              958     2
 6           589 03/01/2017               19     3
 7           584 03/01/2017              496     3
 8           867 03/01/2017              400     3
 9           594 01/02/2017              419     2
10           587 01/02/2017              409     2
11           948 02/02/2017              395     2
12           476 02/02/2017              598     2

First we group_by(Date), which is the target variable for trips frequency. After we use the n() statement inside mutate() to count each day, adding results to a new variable called Trips. Finally we ungroup() the data frame.

Bruno Pinheiro
  • 964
  • 8
  • 20
  • Hi Bruno, thanks! I will do that next time. I've tried what you suggested below and got this error: Error in UseMethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "function" > – Paola Nov 30 '18 at 21:55
  • Hi, @Paola. Edit your answer to add the code. It'll be easier to help. – Bruno Pinheiro Nov 30 '18 at 22:05
  • Hi Bruno, sorry I'm new here. I didn't get what you've asked. Could you tell me what do I have to do, please. Obrigada! – Paola Nov 30 '18 at 22:27
  • @Paola, when asking here on StackOverflow, we look to follow some rules, basically to promote well structured interactions to find solutions and build knowledge. The most important standard is make everything reproducible. Take a look at the link in my answer to understand this better. But you can edit your question to improve, add new insights, ow what you want. If you include a reproducible piece of your data, and the code you are using, you make more clear for us what is happening. If not, less people will help and, like said s_t, we'll keep only messing up without finding a solution. – Bruno Pinheiro Dec 01 '18 at 09:59
  • thanks for the advice. I'll update my question and provide more reproducible information today. – Paola Dec 01 '18 at 11:32
1

You can aggregate them in this way, then merge it with your data.

Using the Bruno Pinheiro data, but renaming the column of the dates:

library(dpyr)
colnames(df)[2] <-'date_trip'                                            # renaming the 2 column                                  
dats <- df %>% group_by(date_trip) %>% summarise(num_of_trips = n()      # count the trip
                                    )
dats
# A tibble: 5 x 3
  date_trip  num_of_trips 
  <fct>             <int>      
1 01/01/2017            3       
2 01/02/2017            2       
3 02/01/2017            2       
4 02/02/2017            2         
5 03/01/2017            3  
s__
  • 9,270
  • 3
  • 27
  • 45
  • @Paola , I cannot help further without a part of your data. Please add the dput(head (df,10)) output editing your question, otherwise we are only messing up without finding a solution. – s__ Dec 01 '18 at 08:32