Filtering using R to a specific date range

Question

I have a set of categorical variables listed by date. The desired outcome is a plot of counts of the categorical variables selected by a particular date range. I can produce a plot of the entire set but no variations that I have found (or people have suggested I use) produces that outcome. Date is formatted as date and libloc is a character. The end result desired is plot of the number of instructions we do in different locations by semester. I understand this is an unimportant/uninteresting question to most of you -- but I am a 62 year old classics librarian stuck at home because of the pandemic having to learn to program so I can keep my job - so can people please be kind. I realize I am not phrasing my question the way you might want but I am doing the best I can trying to do this.

library(ggplot)
library(lubridate)
library(readr)

df <- read_excel("C:/Users/12083/Desktop/instructions/datasetd.xlsx")
df %>%
  select(date,Location) %>%
  filter(date >= as.Date("2017-01-05") & date <= as.Date("2018-01-10"))%>%
  group_by(Location) %>%
  summarise(count=n())
g <- ggplot(df, aes(Location))
g + geom_bar()

Hi Karl. Not sure if your commands are in the order you're running them, but you're re-initialising `df` from `log` after you've filtered by date: `df <- log %>%`. Is that intentional? — Hobo, Jul 16 '20 at 03:52
Hi Karl, good on you for trying to learn something new. It will be much easier to help if you provide at least a sample of your data with `dput(df)` or if your data is very large `dput(df[1:20,])`. You can [edit] your question and paste the output. Please surround the output with three backticks (```) for better formatting. See [How to make a reproducible example](https://stackoverflow.com/questions/5963269/) for more info. — Ian Campbell, Jul 16 '20 at 04:14
I don't have any choice in learning this. I have added the dput as requested. I also tried which didn't work df1 %>% select(date, Location) %>% arrange(date) %>% filter(date >= 2014-08-06 & date =< 2014-08-30) summarise(Location) — Karl Bridges, Jul 16 '20 at 05:00
Thanks for adding the data. It looks like it's after you've summarised it though. Can you add the results of `dput(head(log, 20))` to show us some of the data with the dates? — Hobo, Jul 16 '20 at 06:13
`Class ID` `Department/Col~ `Course Level` `Course Title` `Tour?` `TILT?` date `Session Number` `AM/PM` 1 4438 College of Arts~ Lower Division ACAD 1111 FALSE FALSE 2016-07-20 Third Session AM 2 4439 College of Arts~ Lower Division ACAD 1111 FALSE FALSE 2016-07-20 Third Session PM 3 4428 College of Arts~ Lower Division POLS 1110 FALSE FALSE 2 — Karl Bridges, Jul 16 '20 at 06:36
with 4,340 more rows, and 30 more variables: `Hour Count` , `Library Instructor` , `Other Library # Instructor` , `Duplicate?` , `Course Instructor` , ACRL , IPED , Location , # `Building/Room` , `Distance Class?` , `Location of Site 1` , `Site 1 Number of Students` , — Karl Bridges, Jul 16 '20 at 06:37
Thanks Karl. Can you just run the `dput()` command in my previous command to get the first few rows in a format we can use, and add it to the question (same as you did for Ian)? It'll make it easier to run your code — Hobo, Jul 16 '20 at 07:39
Unfortunately dput only gives me alot of garbage line after line that says NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -I loaded all this into github https://github.com/karl1776/banner/ It's 2 am I gotta quit. Thanks all — Karl Bridges, Jul 16 '20 at 07:59

anakar · Answer 1 · 2020-07-20T13:36:43.270

Hope this helps:

#### Filtering using R to a specific date range ####
# From: https://stackoverflow.com/questions/62926802/filtering-using-r-to-a-specific-date-range

# First, I downloaded a sample dataset with dates and categorical data from here: 
# https://vincentarelbundock.github.io/Rdatasets/datasets.html
# Specifically, I got weather.csv


setwd("F:/Home Office/R")

data = read.csv("weather.csv") # Read the data into R
head(data)                     # Quality control, looks good
data = data[,2:3]              # For this example, I cut it to only take the relevant columns
data$date = as.Date(data$date) # This formats the date as dates for R
library(tidyverse)             # This will import some functions that you need, spcifically %>% and ggplot

# Step 0: look that the data makes sense to you
summary(data$date)
summary(data$city)

# Step 1: filter the right data
filtered = data %>% 
  filter(date > as.Date("2016-07-01") & date < as.Date("2017-07-01")) # This will only take rows between those dates

# Step 2: Plot the filtered data
# Using a bar plot: 
plot = ggplot(filtered, aes(x=city, fill = city)) + geom_bar() # You don't really need the fill, but I like it
plot

# Quality control: look at the numbers before and after the filtering:
summary(data$city)
summary(filtered$city)

Outputs:

> summary(short.data$city)
 Auckland   Beijing   Chicago    Mumbai San Diego 
      731       731       731       731       731 
> summary(filtered$city)
 Auckland   Beijing   Chicago    Mumbai San Diego 
      364       364       364       364       364

You might be able to make it more elegant... but I think it works well

EDIT TO MAKE IT INTO A LINE PLOT

This edit is following your request in the comments:

# Line plot
# The major difference between geom_bar() and geom_line() is that 
# geom_line() requires both an X and Y values.
# So first I created a new data frame which has these values:
summarised.data = filtered %>%
  group_by(city) %>%
  tally()

# Now you can create the plot with ggplot:
# Notes: 
# 1. group = 1 is necessary
# 2. I added geom_point() so that each X value gets a point. I think it's easier to read. You can remove this if you like
plot.line = ggplot(summarised.data, aes(x=city, y=n, group = 1)) + geom_line() + geom_point()
plot.line

Outputs:

> summarised.data
# A tibble: 5 x 2
  city          n
  <fct>     <int>
1 Auckland    364
2 Beijing     364
3 Chicago     364
4 Mumbai      364
5 San Diego   364

Thanks. That works but I need to now figure out how to make that into a line graph. I understand there’s some hack with geom smooth I can use. Anyway this gets me half the way so much appreciated — Karl Bridges, Jul 20 '20 at 01:13
Thanks. I guess I am just stupid or something. I was trying to get a line graph of instructions by a date range for a particular city. What this does is just create this strange line graph with city on the X axis and counts on the Y axis. This does help me understand the concepts better so that's something — Karl Bridges, Jul 20 '20 at 14:35
Not stupid, learning :) Try and draw exactly what you want (even on a piece of paper) and upload here again. Doesn't seem so difficult, just... too many options out there! — anakar, Jul 20 '20 at 16:12
See the photo. Seems easy enough - dates on one axis, count on the other. — Karl Bridges, Jul 20 '20 at 16:30

score 0 · Answer 2 · answered Jul 16 '20 at 13:46

0

Salve!

You might find that my santoku package helps. It can chop dates into intervals:

library(santoku)
library(dplyr)

df_summary <- df %>%
  select(date,Location) %>%
  filter(date >= as.Date("2017-01-05") & date <= as.Date("2018-01-10")) %>%
  mutate(semester = chop(date, as.Date(c("2017-01-05", "2017-01-09")))) %>%
  group_by(Location, semester) %>%
  summarise(count=n())

Obviously you will want to pick your semester dates appropriately.

Then you can print with something like:

ggplot(df_summary, aes(semester, count)) + geom_col() + facet_wrap(vars(location))

answered Jul 16 '20 at 13:46

dash2

2,024
6
15

Thanks. Unfortunately that blows up my console like a bomb with errors. I like the brevity Error: Problem with `mutate()` input `semester`. x Can't combine `..1` > and `..2` . i Input `semester` is `chop(date, as.Date(c("2004-01-05", "2004-05-10")))`. Run `rlang::last_error()` to see where the error occurred. In addition: Warning messages: 1: In mask$eval_all_filter(dots, env_filter) : Incompatible methods ("Ops.factor", ">=.Date") for ">=" 2: In mask$eval_all_filter(dots, env_filter) : Incompatible methods ("Ops.factor", "<=.Date") for "<=" – Karl Bridges Jul 16 '20 at 19:02
Looks as if your `date` column isn't an R `Date` object. You can convert it with `as.Date()`. – dash2 Jul 23 '20 at 11:31

anakar · Answer 3 · 2020-07-22T07:17:41.383

This is a new answer because the approach is different

#### Filtering using R to a specific date range ####
# From: https://stackoverflow.com/questions/62926802/filtering-using-r-to-a-specific-date-range

# First, the data I took by copy and pasting from here: 
# https://stackoverflow.com/questions/63006201/mapping-sample-data-to-actual-csv-data
# and saved it as bookdata.csv with Excel


setwd("C:/Users/di58lag/Documents/scratchboard/Scratchboard")
data = read.csv("bookdata.csv") # Read the data into R

head(data)                                            # Quality control, looks good
data$dates = as.Date(data$dates, format = "%d/%m/%Y") # This formats the date as dates for R
library(tidyverse)                                    # This will import some functions that you need, spcifically %>% and ggplot

# Step 0: look that the data makes sense to you
summary(data$dates)
summary(data$city)

# Step 1: filter the right data
start.date = as.Date("2020-01-02")
end.date   = as.Date("2020-01-04")

filtered = data %>% 
  filter(dates >= start.date & 
         dates <= end.date) # This will only take rows between those dates

# Step 2: Plotting
# Now you can create the plot with ggplot:
# Notes: 
# I added geom_point() so that each X value gets a point. 
# I think it's easier to read. You can remove this if you like
# Also added color, because I like it, feel free to delete

Plot = ggplot(filtered, aes(x=dates, y=classes, group = city)) + geom_line(aes(linetype=city, color = city)) + geom_point(aes(color=city))
Plot

# For a clean version of the plot:
clean.plot = ggplot(filtered, aes(x=dates, y=classes, group = city)) + geom_line(aes(linetype=city))
clean.plot

Outputs: Plot:

Clean.plot:

EDIT: ADDED A TABLE FUNCTION!

After reading your comments I think I figured out what you're trying to do. You asked for:

"counts of location of instructors on the vertical and dates on the horizontal."

The problem is that the original data doesn't actually give you the number of counts - ie "how many times a specific location apears in a specific date". Therefore, I had to add another line using the table function to calculate this:

data.table = as.data.frame(table(filtered))

this calculates how many times each combination of date+location apears and give a value called "Freq".

Now you can plot this Freq as the count as follows:

# Step 1.5: Counting the values
data.table = as.data.frame(table(filtered)) # This calculates the frequency of each date+location combination
data.table = data.table %>% filter(Freq>0)  # This is used to cut out any Freq=0 values (you don't want to plot cases where no event occured)
data.table$dates = as.Date(data.table$dates) # You need to rerun the "as.Date" func because it formats the dates back to "Factors"

#Quality control:
dim(filtered)   # Gives you the size of the dataframe before the counting
dim(data.table) # Gives the size after the counting
summary(data.table) # Will give you a summary of how many values are for each city, what is the date range and what is the Frequency range

# Now you can create the plot with ggplot:
# Notes: 
# I added geom_point() so that each X value gets a point. 
# I think it's easier to read. You can remove this if you like
# Also added color, because I like it, feel free to delete

Plot = ggplot(data.table, aes(x=dates, y=Freq, group = city)) + geom_line(aes(linetype=city, color = city)) + geom_point(aes(color=city))
Plot

# For a clean version of the plot:
clean.plot = ggplot(filtered, aes(x=dates, y=classes, group = city)) + geom_line(aes(linetype=city))
clean.plot

I have a feeling it's not exactly what you wanted becuase the numbers are quite low (ranging between 1-12 counts) but this is what I understand.

OUTPUTS:

> summary(data.table) 
          city        dates                 Freq      
 Pocatello  :56   Min.   :2015-01-12   Min.   :1.000  
 Idaho Falls:10   1st Qu.:2015-02-10   1st Qu.:1.000  
 Meridian   : 8   Median :2015-03-04   Median :1.000  
            : 0   Mean   :2015-03-11   Mean   :1.838  
 8          : 0   3rd Qu.:2015-04-06   3rd Qu.:2.000  
 Boise      : 0   Max.   :2015-06-26   Max.   :5.000  
 (Other)    : 0

This is what I am trying to do but as soon as I used my own csv file it just gives me a blank graph -- which confuses me since there is a classes column there. Why would there be no plot ?I really truly appreciate your help and am sorry I am so dumb about this. I put the csv file on github https://github.com/karl1776/chart — Karl Bridges, Jul 21 '20 at 11:12
First, thanks for uploading. It's MUCH easier to help you out this way Second, what do you want in your Y axis? If I look in your csv there seems to be a column called classes, but under it it just says names, not numbers. Check ```data$classes``` to see it. Maybe there was a bug and what you want to use is under IPED? Try replacing "classes" with "IPED" in your code just to see if it makes sense — anakar, Jul 21 '20 at 12:00
I want dates in the horizontal access and counts in the vertical access. The idea here is to take the individual records in the rows, select by date Range , and then plot them Summarized by date. Classes is actually the location of the instructor - which I want to plot by date and by city with a separate line for each city. This is really a badly thought out data set. They didn’t think when they designed it or organized it. My apologies. It’s one reason this is a pain - not how I would have put this together — Karl Bridges, Jul 21 '20 at 12:44
The plot that you gave is exactly what I am trying to do - counts of location of instructors on the vertical and dates on the horizontal. I am hoping if this works to try and make this sort of the standard plotting we do in my library and elsewhere. So this benefits lots of people - librarians are not good with data visualization - can you tell? Insert smiley face here — Karl Bridges, Jul 21 '20 at 12:51
I reloaded this and changed the classes field name to location. It still makes no difference. It seems to work OK until it gets to the plot function -- for some reason it appears the location field -- which is list of cities where the instructors are -- doesn't group - so do I need a summarize function to turn that into a numeric? — Karl Bridges, Jul 21 '20 at 14:53
The error is somewhere in the plot -- I see an error after plot stating Not sure why I am seeing this .... Don't know how to automatically pick scale for object of type function. Defaulting to continuous. Error: Aesthetics must be valid data columns. Problematic aesthetic(s): y = location. Did you mistype the name of a data column or forget to add after_stat()? — Karl Bridges, Jul 21 '20 at 15:28
Plot = ggplot(filtered, aes(x=dates, y=location, group = city)) + geom_line(aes(linetype=city, color = city)) + geom_point(aes(color=city)) Plot I am convinced there is a problem in this line. — Karl Bridges, Jul 21 '20 at 16:40
The error you are getting: ```Problematic aesthetic(s): y = location``` is because the Y axis you defined makes no sense. You put down "location". That's a list of NAMES. it doesn't make sense to plot it that way. The Y axis needs numerical values. — anakar, Jul 22 '20 at 06:24
See my edit for another solution, using the Table function to get the counts from the data. Hope it helps — anakar, Jul 22 '20 at 07:18

Filtering using R to a specific date range

3 Answers3

EDIT TO MAKE IT INTO A LINE PLOT

EDIT: ADDED A TABLE FUNCTION!