1

I am a Beginner when it comes to work with external databases in R.

A few month ago I already asked how to import a huge dataset into PostgreSqL and got the perfect answer. So I thought I try it again here.

Is there a simple way to do some plots, diagrams or boxplots for external data in R?

Here is my Code:

  1. First I connect to the database, do a join and get some mean values, which is slow but works fine
  2. The Problem is with the last bit of code, where i want to do a plot with the years on xlab and the price on the ylab.
db_tankdata <- 'tankdaten'  
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'  
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db, 
                 port=db_port, user=db_user, password=db_password)

#do a join with tables from database, filter a city

ms_stations_comb <- tbl(con, "prices") %>% left_join(tbl(con, "stations"), by = c("station_uuid" ="uuid")) %>% filter(str_detect(post_code, "^481"))

#get mean prices for different types of fuel

ms_stations_comb %>% summarize(mean_diesel = mean(diesel), mean_e5 = mean(e5), mean_e10 = mean(e10))

#do a plot with years on xlab and price on ylab

ms_stations_comb %>%  dbplot_line(year(date), e5)


The code does give me an error saying:

ERROR: column »dbplyr_016.e5« hast to be in GROUP-BY-clause or appear in a aggregate function LINE 1: SELECT "year(date)", "e5"

Edit: Basically I want to do a plot with years on xlab and price on y lab. The dataset contains for example fuel prices (e5) and dates with the format "2018-04-13 23:17:06".

Thanks in advance!

Gonny
  • 37
  • 5
  • `summarize` is meant to be used after setting a grouping variable with `group_by`. If you include some of your data and the desired result in your post, people can give more specific advice. – rpolicastro Jul 22 '20 at 12:10
  • BTW: there is no `year()` function in SQL. (it is a Sybase/Microsoft extension) – wildplasser Jul 22 '20 at 12:32

1 Answers1

1

In general, plotting data in R requires the data to be in R's local memory. If there is too much data to load into R's local memory then you probably should not be plotting it (I once tried to plot 100M data points it went badly).

What I recommend is preparing the data in the database and then loading into R's local memory only the data that you need for the plot.

remote_summary = remote_table %>%
  mutate(the_year = YEAR(date)) %>%
  group_by(the_year) %>%
  summarise(e5 = mean(e5))

local_table = collect(remote_summary)

# ggplot or you preferred plotting commands here using local_table

In the above code, we first make a new variable as the year, then produce the mean e5 value for every year. This produces the summary you want to plot in the database.

collect can then be used to load the remote summary into R's local memory, and the data in the local table can then be plotted.

As @wildplasser points out there is no year() function in postgresql. You probably want DATE_PART instead. Hence you code looks like:

remote_summary = remote_table %>%
  mutate(the_year = DATE_PART(YEAR, date)) %>%
  ...

Because DATE_PART is not an R function, there is no dbplyr translation defined for it, so it should be passed directly to postgresql as is - producing a valid postgresql query. You can check whether the underlying query is correct using show_query. I recommend:

show_query(remote_summary)

before collecting the remote summary. If showing query displays a valid sql query then the collect should work. Otherwise, you will need to adjust you definition of remote_summary to get a valid sql query.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • And there is no way to plot the data if it is too big? To find outliers in the dataset, I really would have to make some boxplots – Gonny Jul 23 '20 at 15:00
  • dbplyr lets R push dplyr like manipulations into sql. Making a plot is not something sql is designed for, hence you can not push generation from R to sql. – Simon.S.A. Jul 23 '20 at 20:55
  • However, you do not need to load all the data into R in order to produce a box plot. You could calculate the median, upper quartile, lower quartile, min and max in sql. These values could then be loaded into R and plotted. This would be a much smaller amount of information transferred from R to sql. This question might assist you to do this: https://stackoverflow.com/questions/14316562/nth-percentile-calculations-in-postgresql – Simon.S.A. Jul 23 '20 at 21:01