0

I am trying to combine 3 separate data frames together (2018, 2019 and 2020 files) to make one time series graph from January - December but so far I can only make one long graph, January 2018 to June 2020. I would like to see a line graph with each line representing one year. Previews of each of the 3 years of data is pasted below. It is hard to tell from below but Month, Day and Hour are each in individual columns.

I've tried using rbind, merge, and several other functions but can't quite get it. Thanks for your help!

2020 DATA
Site        Parameter           Date (LT)       Year    MonthDayHourRaw Conc.
Kathmandu   PM2.5 - Principal   1/1/2020 1:00   2020    1   1   1   171
Kathmandu   PM2.5 - Principal   1/1/2020 2:00   2020    1   1   2   161
Kathmandu   PM2.5 - Principal   1/1/2020 3:00   2020    1   1   3   121
Kathmandu   PM2.5 - Principal   1/1/2020 4:00   2020    1   1   4   119
Kathmandu   PM2.5 - Principal   1/1/2020 5:00   2020    1   1   5   137

2019 DATA
Site    Parameter               Date (LT)       Year    MonthDayHourRaw Conc.
Kathmandu   PM2.5 - Principal   1/1/2019 1:00   2020    1   1   1   200
Kathmandu   PM2.5 - Principal   1/1/2019 2:00   2020    1   1   2   185
Kathmandu   PM2.5 - Principal   1/1/2019 3:00   2020    1   1   3   180
Kathmandu   PM2.5 - Principal   1/1/2019 4:00   2020    1   1   4   190
Kathmandu   PM2.5 - Principal   1/1/2019 5:00   2020    1   1   5   200


2018 DATA
Site        Parameter           Date (LT)       Year    MonthDayHourRaw Conc.
Kathmandu   PM2.5 - Principal   1/1/2019 1:00   2020    1   1   1   250
Kathmandu   PM2.5 - Principal   1/1/2019 2:00   2020    1   1   2   215
Kathmandu   PM2.5 - Principal   1/1/2019 3:00   2020    1   1   3   270
Kathmandu   PM2.5 - Principal   1/1/2019 4:00   2020    1   1   4   221
Kathmandu   PM2.5 - Principal   1/1/2019 5:00   2020    1   1   5   250
OTStats
  • 1,820
  • 1
  • 13
  • 22
Les
  • 1
  • 2
  • Welcome to SO! If you run `dput(your_data)` and paste the outputs here it will be easier to help. [Here](https://stackoverflow.com/a/5963610/12400385) is a nice resource on giving a minimal reproducible example. – nniloc Jun 10 '20 at 00:08
  • 1
    Shouldn't the `Year` column in each dataframe be different? Right now they all say 2020? – Chuck P Jun 10 '20 at 03:57
  • what you want is a dataframe where one column is the year and the other column is the `Raw Conc.` value for all entries in that year? – dvd280 Jun 10 '20 at 05:48
  • Are you looking to compare years? In that case, would you want to create a line chart with 3 lines (one for each year) and have the x axis as month and y axis `Raw Conc.`? – chemdork123 Jun 10 '20 at 11:36
  • Thank you for these comments! Yes @chemdork123 I am looking to compare years and create a line chart. Time (date or month) would be the x axis and y axis would be the Raw Conc. I'm just not sure which commands to use to ask R to plot the data by month and day for 3 years on top of one another (instead of just with a very long x axis of month and day in 2018, month and day in 2019 and month and day in 2020. The individual components of the date variable are separated into 3 variables - year, month, and day, so perhaps I should do something there? – Les Jun 10 '20 at 13:47
  • Thank you Chuck P you are correct- I pasted the data above and modified the 2020 data frame to create fake 2019 and fake 2018 data just for ease of reference. But you are correct that the year variable in the 2019 dataframe should be 2019 and the year in the 2018 df should be 2018. Thanks. – Les Jun 10 '20 at 13:48

1 Answers1

1

It would have been helpful if you were able to post your dataframes in a format that is easily reproduced, such as using dput(data.frame). Regardless, I have replicated your dataset in approximation by creating the following individual dataframes, df_2018, df_2019, and df_2020.

df_2018 <- data.frame(Raw.Conc=c(171,161,121,119,137), Date=c('1/5/2018','2/12/2018','3/1/2018','4/4/2018','5/2/2018'))
df_2019 <- data.frame(Raw.Conc=c(200,185,180,190,200), Date=c('1/2/2019','2/12/2019','3/3/2019','4/1/2019','5/6/2019'))
df_2020 <- data.frame(Raw.Conc=c(250,215,270,221,250), Date=c('1/1/2020','2/5/2020','3/4/2020','4/7/2020','5/5/2020'))

Preparation of Data

Each dataframe contains two columns: one that is a numeric vector, $Raw.Conc, and one that is right now a factor, $Date. The first step is that we should combine the datases into one. Critically, we need to maintain information on the origin of each dataset, which I can do first by creating a new column called something like $origin for each, but it will be more straightforward to separate out the data in categories after convering the $Date column to a class of Date using the lubridate package and pulling out specifically the year for each date. See the code below:

# merging datasets
df <- rbind(df_2018, df_2019, df_2020)

# format $Date field
df$Date <- as.Date(df$Date, format='%m/%d/%Y')

# create new column for year
df$Year <- as.character(year(df$Date))

Note that I'm using the year() function from lubridate to create df$Year. The output of year() is numeric, so I covert that to a character vector to ensure that this particular column is treated as discrete and not continuous for our purposes later.

Plot

This can be enough to create the plot, but remember that df$Date is indicating different years, so the lines will not overlap the same space on the x axis, but be separated. Observe the output when we plot df$Date as the x axis, df$Raw.Conc on the y axis, and use df$Year as the color= aesthetic for geom_line():

ggplot(df, aes(Date, Raw.Conc)) + geom_line(aes(color=Year))

enter image description here

That's not what you want, from my understanding. We effectively need to remove the date and just plot month/day. There might be an elegant way to do that... but here I'll hack a solution that basically puts all of your dates in the same year, so that way they will be plotted as if they were in the same year (which is the behavior you want). Note the code below:

# create character vector of "mm-dd"
df$Date.axis <- paste0(month(df$Date),'-',day(df$Date))
# turn that into a class(Date)
df$Date.axis <- as.Date(df$Date.axis, format='%m-%d')

If you check df$Date.axis, you'll see it included the current year (2020) for all of them, but it will work for our purposes. We can now plot, but use scale_x_date() to control labeling of the x axis so that we only include the month (and ignore the year).

ggplot(df, aes(x=Date.axis, y=Raw.Conc)) + geom_line(aes(color=Year)) +
  scale_x_date(date_labels = '%b')

enter image description here

Change aesthetics and labels to get the chart you want, but that's a general approach that should work for what you are looking to do.

chemdork123
  • 12,369
  • 2
  • 16
  • 32
  • 1
    Thank you chemdork123! Really appreciate your advice on this!! Have a great evening. – Les Jun 10 '20 at 22:15