0

The data is from Citi Bikes NYC for January 2019 to December 2019,the data can be viewed here: https://ride.citibikenyc.com/system-data

This is an example of the relevant columns, which are noot part of the original data set but were added by me:

ride_distance_meters ride_distance_mtrs_rnd
1065 1100
578 600
2036 2000
1406 1400
1315 1300
20,551,692 more rows 20,551,692 more rows

when I run the following code I get a histogram that is not smooth:

ridedata_clean %>% count(ride_distance_meters) |> ggplot(aes(x = ride_distance_meters, y= n)) +
geom_histogram(stat='identity', freq = FALSE, color = "#F8766D") +xlim(0,3000) + ylim(0,27000)

enter image description here

I rounded the ride_distance_meters and hoped that I would get a smoother distribution, but that did not happen:

ridedata_clean$ride_distance_mtrs_rnd <- round(ridedata_clean$ride_distance_meters,-2)

ridedata_clean %>% count(ride_distance_mtrs_rnd) |> ggplot(aes(x = ride_distance_mtrs_rnd, y= n)) +
geom_histogram(stat='identity', freq = FALSE, color = "#F8766D") +xlim(0,3000) + ylim(0,27000)

enter image description here

Can anyone tell me what I am doing wrong.

  • Maybe a density plot with geom_density() would fit your needs? – OliverHennhoefer Aug 14 '22 at 11:48
  • Didn't work, it is still not smooth – Madhav Ddas Aug 14 '22 at 12:23
  • @OHennhoefer geom_density() without rounding : https://imgur.com/a/TholXG5, geom_density() with rounding https://imgur.com/sO4deSj – Madhav Ddas Aug 14 '22 at 12:29
  • First, you round with round(x, -2) you should round with round(x, 2). Second, a density plot shows the density function, it is "smooth" by definition since it does not show bars but a line/function. Take a look at http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization to get it right. – OliverHennhoefer Aug 14 '22 at 12:44
  • While geom_density is the better solution, probably, OP could also use the binwidth = and bins = parameters of geom_histogram to make a smoother plot. – John Garland Aug 14 '22 at 13:05
  • @OHennhoefer There doesn't seem to be a problem with too much rounding, see above column ride_distance_mtrs_rnd – Madhav Ddas Aug 14 '22 at 13:07
  • A histogram does not need a `y` aesthetic - try removing it, as well as the `stat="identity"` term. – Andrew Gustar Aug 14 '22 at 14:23
  • 2
    Why `count(ride_distance_mtrs)`? A histogram doesn't need a count, it will do it on its own, only a barplot does. – Rui Barradas Aug 14 '22 at 19:00
  • 2
    `ridedata_clean %>% ggplot(aes(x = ride_distance_mtrs)) + geom_histogram(color = "#F8766D")` As Rui says, a histogram does it's own counting and binning. If you want more or less bins use the `binwidth` or `bins` argument--don't round your data. – Gregor Thomas Aug 14 '22 at 19:54
  • You should post the code that computes `ride_distance_meters` from the original data files. – Rui Barradas Aug 14 '22 at 20:57
  • @RuiBarradas removing count is still giving the same results https://imgur.com/9GLNFo8 – Madhav Ddas Aug 14 '22 at 21:05
  • @GregorThomas Will try to adjust the bin width – Madhav Ddas Aug 14 '22 at 21:06
  • In the page you link to there is a link to download the data files, which data files are you using, the files starting with numbers `2019` or the files starting with `JC-2019`? – Rui Barradas Aug 14 '22 at 21:15
  • @RuiBarradas the files starting with 2019, the ones starting with JC are for Jersey City – Madhav Ddas Aug 15 '22 at 05:53

1 Answers1

2

Here are the histograms. There is no need to count first nor to have an identity stat, geom_histogram will bin the data and count how many data points are in each bin automatically.
The first histogram uses the default number of bins.

library(ggplot2)

ggplot(ridedata_clean, aes(ride_distance_mtrs)) +
  geom_histogram(fill = "#F8766D") +
  ggtitle(label = "Default number of bins: 30") +
  theme_bw()

enter image description here


Now vary the number of bins to see the number of bars increase, therefore making the histograms less smooth.
Base R has functions that return numbers of bins according to different criteria, see the documentation for the functions below here.

nclass.Sturges(dist_mtrs)
# [1] 26
nclass.scott(dist_mtrs)
# [1] 1001
nclass.FD(dist_mtrs)
# [1] 1676

Create a vector of number of bins and loop through it, drawing the histograms and saving them in a list. The y axis limits must be adjusted in the last case. Then plot them in a grid, see for instance this SO question.

# numbers of bins
bins_vec <- c(30, 50, 100, nclass.FD(dist_mtrs))
gg_plots <- vector("list", length(bins_vec))

for(i in seq_along(bins_vec)) {
  main_title <- sprintf("Histogram of distances, %d bins", bins_vec[i])
  if(i == 4)
    main_title <- paste(main_title, "(FD)")
  gg_plots[[i]] <- ggplot(ridedata_clean, aes(ride_distance_mtrs)) +
    geom_histogram(bins = bins_vec[i], fill = "#F8766D") +
    ggtitle(label = main_title) +
    theme_bw()
  if(i == 4)
    gg_plots[[i]] <- gg_plots[[i]] + ylim(0, 1.5e5)
}

gridExtra::grid.arrange(grobs = gg_plots)

enter image description here


Alternatively, the bins can be set with binwidth, to give control on the bins cut points, not on the total number of bins.

# bins widths
binwidths_vec <- c(10, 50, 100, 1000)
gg_plots2 <- vector("list", length(binwidths_vec))

for(i in seq_along(bins_vec)) {
  main_title <- sprintf("Histogram of distances, bin width: %d", binwidths_vec[i])
  gg_plots2[[i]] <- ggplot(ridedata_clean, aes(ride_distance_mtrs)) +
    geom_histogram(binwidth = binwidths_vec[i], fill = "#F8766D") +
    ggtitle(label = main_title) +
    theme_bw()
  if(i == 1)
    gg_plots2[[i]] <- gg_plots2[[i]] + ylim(0, 1.25e5)
}

gridExtra::grid.arrange(grobs = gg_plots2)

enter image description here

There is a visible data artifact, the first bar seems to be due to values near zero. After inspecting the data, I have found that around 2.16% of the distances are equal to zero. I have not determined the file or files those values come from.

sum(ridedata_clean$ride_distance_mtrs == 0)
# [1] 443021

100*mean(ridedata_clean$ride_distance_mtrs == 0)
# [1] 2.155642

Data

Assuming that the files were downloaded to the current directory, the following code was used to transform lon/lat coordinates to distances in meters. This takes a long time with a mid-range year 2022 PC running R 4.2.1 on Windows 11. I have not parallelized the code.

Note that the number of rows 20551697 is equal to the 5+20551692 rows in the question's posted data.

library(readr)
library(geosphere)

read_citibike_file <- function(filename, cols, verbose = TRUE) {
  Y <- readr::read_csv(
    file = filename, 
    col_types = "dddd", 
    col_select = all_of(cols)
  )
  Y <- as.matrix(Y)
  if(verbose) {
    mat_size <- round(utils::object.size(Y)/1024/1024, digits = 1)
    cat("file:", filename, "\tdata size:", mat_size, "Mb\trows:", nrow(Y), "\n")
  }
  Y
}
convert_lonlat_dist <- function(data, start_cols, end_cols) {
  d <- numeric(nrow(data))
  for(i in seq_along(d)) {
    start <- data[i, start_cols, drop = TRUE]
    end <- data[i, end_cols, drop = TRUE]
    tryCatch(
      d[i] <- distm(start, end, fun = distHaversine),
      error = function(e) print(conditionMessage(e))
    )
  }
  d
}

zip_files <- list.files(pattern = "\\.zip$")
for(f in zip_files) unzip(f, exdir = ".")

fls <- list.files(pattern = "\\.csv$")

lon1 <- "start station longitude"
lat1 <- "start station latitude"
lon2 <- "end station longitude"
lat2 <- "end station latitude"
start_cols <- c(lon1, lat1)
end_cols <- c(lon2, lat2)
lonlat_cols <- c(start_cols, end_cols)

dist_mtrs <- sapply(fls, \(x) {
  y <- read_citibike_file(x, cols = lonlat_cols)
  convert_lonlat_dist(y, start_cols, end_cols)
})

dist_mtrs <- unlist(dist_mtrs)
length(dist_mtrs)
# [1] 20551697
ridedata_clean <- data.frame(ride_distance_mtrs = dist_mtrs)
nrow(ridedata_clean)
# [1] 20551697
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66