0

Background:

This question is perhaps a bit broad, but hopefully interesting to anyone using relational data, R, Power BI or all of the above.

I'm trying to recreate a relational model for the dataset nycflights13 described in the book R for Data Science by Wickham and Grolemund. And I'm trying to do so using both R and Power BI. The dataset consists of the 5 tables airlines, ariports, flights, weather and planes. In section 13.2 nycflights13 theres a passage that states:

flights connects to weather via origin (the location), and year, month, day and hour (the time).

The relationships are illustrated by this figure:

enter image description here

Question 1: How can I set up this model in Power BI?

Using the following R script will make the datasets available for Power BI in the folder c:/data:

# install.packages("tidyverse")
# install.packages("nycflights13")

library(tidyverse)
library(nycflights13)
setwd("C:/data/")
#getwd()

airlines
df_airlines <- data.frame(airlines)
df_airports <- data.frame(airports)
df_planes <- data.frame(planes)
df_weather <- data.frame(weather)
df_flights <- data.frame(flights)

write.csv(df_airlines, file = "C:/data/airlines.txt", row.names = FALSE)
write.csv(df_airports, file = "C:/data/airports.txt", row.names = FALSE)
write.csv(df_planes, file = "C:/data/planes.txt", row.names = FALSE)
write.csv(df_weather, file = "C:/data/weather.txt", row.names = FALSE)
write.csv(df_flights, file = "C:/data/flights.txt", row.names = FALSE)

Having imported the tables in Power BI, I'm trying to establish the relations in the Relationships tab:

enter image description here

And I'm able to do so to some extent, but when I try to connect flights to weather using for example year, I'm getting the following error message:

You can't create a relationship between these two columsn because one of the columns must have unique values.

And I understand that this happens because primary keys must contain unique values and cannot contain null values. But how can you establish a primary key in power BI that consists of multiple fields?

Question 2: If there's no answer to question 1, how can you do this in R instead?

I really love this book, and It may even be described there already, but how do you establish a relationship like this in R? Or perhaps you don't need to since you can join on multiple columns or composite key using dplyr without there being 'established' a relatinship at all?

Put another way, are the relationship illustrated by the figure aboe with the arrows:

enter image description here

and in Power BI with the lines:

enter image description here

really not necessary in R as long as you have the required verbs and there actually does exist a relatinship between the data in the different tables?

Question 3 - Why is flight highlighted in the flights table:

I thought that a highlighted column name indicated that there had been established a connection between tables using that column. But as far as I can tell, that is not the case here, and there is no arrow pointing to it:

enter image description here

Does it perhaps indicate that it is a primary key in the flights table without any connection to another table?


I know this is a bit broad, but I'm really curious about these things so I'm hoping some of you will find it interesting!

BJ Myers
  • 6,617
  • 6
  • 34
  • 50
vestland
  • 55,229
  • 37
  • 187
  • 305
  • 1
    Hi. As I just said on [an earlier question](https://stackoverflow.com/questions/52861280/what-is-happening-under-the-hood-when-a-relationship-is-established-between-tabl): FKs & other constraints are not needed to query. They tell the DBMS to disallow invalid states. They suggest indexing for optimization. Some tools use them to automatically generate parts of queries because queries frequently use the columns. PS Please ask one question per post. – philipxy Oct 20 '18 at 00:47
  • 1
    Power BI Developer doesn't actually have a concept of CK/PK/FK. Its "relationships" are for building queries. Which do often exist where FKs do including from query results to base tables. See [Create and manage relationships in Power BI Desktop](https://learn.microsoft.com/en-us/power-bi/desktop-create-and-manage-relationships#understanding-relationships) & its Download PDF link for Develper. So your question doesn't really make sense. Please edit it to clearly distinguish between FKs (aka pseudo-ER "relationships") & BI "relationships". PS What generated the oval/arrow diagram? – philipxy Oct 21 '18 at 03:44

1 Answers1

1

I can comment on Power BI part.

The key issue here is that Power BI requires Dimensional Model, not relational one. There is a huge difference.

As described, the model from the book is not suitable for BI tools, it must be redesigned. For example, table "Weather" in the book is presented as a "dimension", while in reality it must be a fact table (similar to table "Flights"). As a result, "Flights" and "Weather" should never have direct connections - they must share common dimensions, such as:

  • Airport
  • Airline
  • Plane
  • Date
  • Time

Similarly, multiple keys and multiple connections between tables are very rare exceptions and are frowned upon (usually, they are indications of design mistakes). In a properly designed model, you should never see them.

If you want to understand the issue more, read this book: Star Schema Complete Reference

To answer your Q3 specifically, in dimensional modeling "Flight" (I assume it's flight number) is called a "degenerate dimension". Normally, it would have been a key to a dimension table, but if it's absent, it stays in a fact table as an orphan key. Such situation is common for Order numbers, Invoice numbers, etc.

Degenerate dimensions

Overall, you are on the right track - if you figure out how to transform the model from the book into a proper star schema, and then use it in R and PowerBI, you will be impressed with the new capabilities - it's worth it.

RADO
  • 7,733
  • 3
  • 19
  • 33