0

I'm a beginner learner in R and my data is the data of a hotel with 3 different apartments and its bookings between 2018 and 2022. For each booking I have an arrival date and a departure date (already in the date format), the total nights booked and which apartment they stayed in. I now want to analyze and plot the data to see how the number of bookings have changed over time overall and for each of the different apartments. I would really like to get a graph that looks like a geom_line graph in ggplot, but I can't figure out the best way to do this, and how to put the "number of bookings" into a variable.

Thank you!

This is the head of my data:

  arr_date   dep_date total_nights hochsaison mittelsaison nebensaison ost west sued ost.west sued.west sued.ost gesamtes_haus year month wday
1 2018-01-12 2018-01-14            2          0            0           2   0    0    0        0         0        2             0 2018   Jan   Fr
2 2018-01-17 2018-01-21            4          0            0           4   0    4    0        0         0        0             0 2018   Jan   Mi
3 2018-01-21 2018-01-24            3          0            0           3   0    0    3        0         0        0             0 2018   Jan   So
4 2018-02-09 2018-02-11            2          0            2           0   0    0    0        2         0        0             0 2018   Feb   Fr
5 2018-02-09 2018-02-13            4          0            4           0   0    0    4        0         0        0             0 2018   Feb   Fr
6 2018-02-16 2018-02-18            2          0            0           2   0    0    0        0         0        0             2 2018   Feb   Fr


Rows: 323
Columns: 16
$ arr_date      <date> 2018-01-12, 2018-01-17, 2018-01-21, 2018-02-09, 2018-02-09, 2018-02-16, 2018-02-23, 2018-02-22, 2018-03-19, 2018-03-27, 2018-03-29, 2018-04-19, 2018-04-27, 201…
$ dep_date      <date> 2018-01-14, 2018-01-21, 2018-01-24, 2018-02-11, 2018-02-13, 2018-02-18, 2018-02-25, 2018-02-25, 2018-03-24, 2018-04-03, 2018-04-04, 2018-04-24, 2018-05-01, 201…
$ total_nights  <dbl> 2, 4, 3, 2, 4, 2, 2, 3, 5, 7, 5, 5, 4, 3, 6, 5, 17, 7, 7, 15, 13, 9, 19, 7, 7, 14, 8, 13, 10, 11, 6, 7, 14, 6, 5, 5, 13, 7, 7, 10, 3, 6, 5, 7, 12, 12, 12, 2, 2,…
$ hochsaison    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 5, 0, 3, 3, 6, 5, 17, 7, 7, 15, 13, 9, 19, 7, 7, 14, 8, 13, 10, 11, 6, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mittelsaison  <dbl> 0, 0, 0, 2, 4, 0, 0, 0, 0, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 14, 6, 5, 5, 13, 7, 7, 10, 3, 6, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ nebensaison   <dbl> 2, 4, 3, 0, 0, 2, 2, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 12, 12, 2, 2, 4, 2, 4, 1…
$ ost           <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 15, 0, 0, 19, 0, 0, 0, 0, 13, 0, 0, 0, 0, 0, 0, 5, 5, 0, 7, 0, 0, 0, 6, 0, 7, 12, 0, 0, 0, 0, 0, 0, 0,…
$ west          <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 5, 0, 5, 5, 0, 0, 6, 0, 0, 0, 7, 0, 13, 0, 0, 7, 0, 14, 0, 0, 0, 11, 0, 7, 14, 0, 0, 0, 0, 0, 0, 10, 0, 0, 5, 0, 0, 12, 0, 0, 0, 0, 2, 0…
$ sued          <dbl> 0, 0, 3, 0, 4, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 7, 0, 8, 0, 10, 0, 6, 0, 0, 6, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 12, 0, 0, 4, 0, 0, 0,…
$ ost.west      <dbl> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ sued.west     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ sued.ost      <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0…
$ gesamtes_haus <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 3, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0…
$ year          <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018…
$ month         <chr> "Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Feb", "Feb", "Mär", "Mär", "Mär", "Apr", "Apr", "Mai", "Mai", "Jun", "Jun", "Jun", "Jun", "Jul", "Jul", "Jul", "Jul",…
$ wday          <chr> "Fr", "Mi", "So", "Fr", "Fr", "Fr", "Fr", "Do", "Mo", "Di", "Do", "Do", "Fr", "Fr", "Sa", "Fr", "Do", "Sa", "Sa", "Sa", "So", "Do", "So", "Sa", "Sa", "Sa", "Sa"…
> 


structure(list(arr_date = structure(c(17543, 17548, 17552, 17571, 
17571, 17578), class = "Date"), dep_date = structure(c(17545, 
17552, 17555, 17573, 17575, 17580), class = "Date"), total_nights = c(2, 
4, 3, 2, 4, 2), hochsaison = c(0, 0, 0, 0, 0, 0), mittelsaison = c(0, 
0, 0, 2, 4, 0), nebensaison = c(2, 4, 3, 0, 0, 2), ost = c(0, 
0, 0, 0, 0, 0), west = c(0, 4, 0, 0, 0, 0), sued = c(0, 0, 3, 
0, 4, 0), ost.west = c(0, 0, 0, 2, 0, 0), sued.west = c(0, 0, 
0, 0, 0, 0), sued.ost = c(2, 0, 0, 0, 0, 0), gesamtes_haus = c(0, 
0, 0, 0, 0, 2), year = c(2018, 2018, 2018, 2018, 2018, 2018), 
    month = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb"), wday = c("Fr", 
    "Mi", "So", "Fr", "Fr", "Fr"), wohnung = c("kombi", "west", 
    "sued", "kombi", "sued", "gesamtes haus"), saison = c("nebensaison", 
    "nebensaison", "nebensaison", "mittelsaison", "mittelsaison", 
    "nebensaison")), row.names = c(NA, 6L), class = "data.frame")
Sofia
  • 21
  • 3
  • 2
    Could you use `dput` function on your dataFrame and then write the structure output in your question – Yacine Hajji Aug 19 '22 at 13:09
  • 1
    Hi Sofia. Welcome to Stack Overflow. You have given a description of your problem and your data, but that means we can only give you a description of the solution: group the data by month and apartment, then sum the nights booked. Plot the month of the booking on the x axis and the sum of the nights booked on the y axis, with color mapped to apartment. If you need more concrete help, we need to see a sample of your actual data so we know what your data frame and columns are called. Please edit your question to include this data if that is the case. Thanks – Allan Cameron Aug 19 '22 at 13:26
  • Hello! Thank you for commenting! I put the head of my data in my question, does that help? – Sofia Aug 19 '22 at 13:49
  • Hi Sofia - thanks for sharing data this time and for the structure. The structure `glimpse` isn't nearly as nice a summary as `dput()` gives. If you could please run the command `dput(your_data[1:6, ])` that will make a copy/pasteable version of your data that we can run to recreate your data. Showing that along with the `head()` you already show will make it much easier to help you. – Gregor Thomas Aug 19 '22 at 13:50
  • A little more context would also be helpful. Can you describe how to derive "number of bookings" from your data? And you say you want to look at things "for each of the different apartments"--which columns indicate apartments? – Gregor Thomas Aug 19 '22 at 13:54
  • Okay, I have never used the dput function, I hope I used it correctly. – Sofia Aug 19 '22 at 13:57
  • Well one observation from my data represents one booking. Each booking is made for a specific apartment (sometimes a combination of two apartments or for the whole house) and for a specific number of nights. For each apartment and their combinations I created a variable. So one apartment is called "ost", another is called "west", the third is called "sued". The combination is called "sued.west" and so on and the whole house is called "gesamtes_haus". – Sofia Aug 19 '22 at 13:59
  • Your original post says that the data covers 2018 to 2022, but your provided dataframe only covers January and February 2018. Could you provide the entire (or at least more of) your dataframe? – jrcalabrese Aug 20 '22 at 13:46
  • @jrcalabrese oh please no. **minimal** reproducible examples are great. We don't need 100s or 1000s of rows to illustrate problems. – Gregor Thomas Aug 22 '22 at 15:49
  • @Sofia so, I see your input data and I can reproduce it--which is great. Can you help me understand what you would like to get out? Just for these 6 sample rows, what is the desired result? If you want a line graph, do you want it to have the number of bookings at every single date? Or maybe aggregated by week? By month? Something else? – Gregor Thomas Aug 22 '22 at 15:52
  • @GregorThomas thank you for your reply and I'm sorry for my late answer. I'm still trying to get used to the data itself. For example I was wondering - does R know the bookings for every single date even though I only put in the arrival and departure date? I think what I would like to see is a development of bookings over time - so either for every single date, or over the years / months. Right now I can't tell what time phrame would be best, but I would like to see changes over time. Does that make sense? – Sofia Aug 29 '22 at 16:30
  • Also I would like to map the number of bookings (one booking is one person booking the place, regardless of how many days they stayed - this is basically an observation (the rows in my data)) but also it would be cool to map the number of days booked - so this would take into account how long a person has booked the place. – Sofia Aug 29 '22 at 16:33
  • No, R doesn't know that you intend these columns to be start and stop times and that you want to know information about the days in between. I've added an answer to show how to do that, which is a good starting place for you. – Gregor Thomas Aug 29 '22 at 17:30

1 Answers1

0

Here's something to get you started. I'd suggest you play with this and refine your thoughts and ask a new question when you have a clearer idea of your goal.

library(dplyr)
library(tidyr)
library(lubridate)
library(purrr)
library(ggplot2)

apt = c("ost", "west", "sued", "ost.west", "sued.west", "sued.ost")

df %>%
  ## get the days in each booking interval
  mutate(days = map2(arr_date, dep_date, .f = seq.Date, by = "1 day")) %>%
  ## get rid of unneeded columns
  select(all_of(apt), days) %>%
  ## convert the night counts in each apartment to 1 or 0
  mutate(across(all_of(apt), pmin, 1)) %>%
  ## pivot to long format
  pivot_longer(cols = all_of(apt), names_to = "apt") %>%
  ## discard apartments with no booking
  filter(value != 0) %>%
  select(-value) %>%
  ## convert days to its own column
  unnest(days) -> df_long

## count days for plotting
count(df_long, days) %>%
  ggplot(aes(x = days, y = n)) +
  geom_line() +
  labs(
    x = "date",
    y = "separate bookings"
  ) +
  expand_limits(y = 0)

enter image description here

To understand what the code does, run it line by line to see what is changing each time (that is first df %>% mutate(...), then df %>% mutate(...) %>% select(...), then df %>% mutate(...) %>% select(...) %>% mutate(...)).

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Hi @Gregor Thomas, thank you so much for your help. I do have some questions:) First I wanted to know - is there a difference between using = instead of <- ? Or is there a specific reason why you used =? Secondly - is there a difference to a reason why you used ## instead of #? – Sofia Sep 08 '22 at 11:08
  • Also - I did what you suggested and try to run the first line only. I tried to use different names, to not only copy yours but to make sure I understand what's going on. It did throw an error though in the first line, even when I ran the first two lines. This is my code: `wohnung2 <- c('ost', 'west', 'sued', 'ost.west', 'sued.west', 'sued.ost', 'gesamtes haus') ue %>% #get the days in each booking interval mutate(days = map2(arr_date, dep_date, .f = seq.Date, by = '1 day')) %>% #get rid of unneeded columns select(all_of(wohnung2), days)` – Sofia Sep 08 '22 at 11:13
  • and this it the error: Error in `mutate()`: ! Problem while computing `days = map2(arr_date, dep_date, .f = seq.Date, by = "1 day")`. It says there's a wrong sign with the "by" argument". – Sofia Sep 08 '22 at 11:13
  • No real differences between `<-` and `=` (you can read about the slight differences at [this FAQ](https://stackoverflow.com/q/1741820/903061). As for the comment characters, on Stack Overflow I tend to show output with `#` and use `##` for actual comments. – Gregor Thomas Sep 08 '22 at 13:31
  • That error that you got is what would happen if you had an `arr_date` *after* a `dep_date`. Which seems like it shouldn't happen in your data. So maybe check your data and correct errors where the arrival is after the departure. – Gregor Thomas Sep 08 '22 at 13:32