Star schema in Python Pandas

Question

I currently have a project where I extract the data from a Firebird database and do the ETL process with Knime, then the CSV files are imported into PowerBI, where I create table relationships and develop the measures. With Knime I summarize several tables, denormalizing. I would like to migrate to Python completely, I am learning Pandas.

I would like to know how to deal with relational modeling in Python, star schema for example. In PowerBI there is a section dedicated to it where I establish relationships, indicating if they are uni or bi directional.

The only thing I can think of so far is to work in Pandas with joins in every required situation / function, but it seems to me that there must be a better way.

I would be grateful if you would indicate that I should learn to face this.

What specifically are you trying to replace with Python? Are you replacing your ETL processes with it and moving your data to where it can be pulled into PowerBI? Or are you replacing everything with Python, including your dashboards? — LTheriault, Apr 08 '20 at 15:30
Everything, regarding extraction and transformation I have understood what I should do, thanks to the documentation of pandas, but so far what I do not understand is if I should recreate the star schema model in some way in python, or what is the approach in this regard . — Santiago Szelenko, Apr 08 '20 at 18:22

score 3 · Answer 1 · answered Apr 08 '20 at 19:25

I think I can answer your question now that I have a better understanding of what you're trying to do in Python. My stack for reporting also involves Python for ETL operations and Power BI for the front end, so this is how I approach it even if there may be other ways that I'm not aware of.

While I create actual connections in Power BI for the data model I am using, we don't actually need to tell Python anything in advance. Power BI is declarative. You build the visualizations by specifying what information you want related and Power BI will do the required operations on the backend to get that data. However, you need to give it some minimal information in order to do this. So, you communicate the way you want the data modeled to the Power BI.

Python, in contrast, is imperative. Instead of telling it what you want at the end, you tell it what instructions you want it to perform. This means that you have to give all of the instructions yourself and that you need to know the data model.

So, the simple answer is that you don't need to deal with relational modeling. The more complicated and correct answer is that you need to plan your ETL tools around a logical data model. The logical data model doesn't really exist in one physical space like how Power BI stores what you tell it. It basically comes down to you knowing how the tables are supposed to relate and ensuring that the data stored within them allows those transformations to take place.

When the time comes to join tables in Python, perform join operations as needed, using the proper functions (i.e. merge()) in combination with the logical data model you have in your head (or written down).

The link I'm including here is a good place to start research/learning on how to think about data modeling on the more conceptual level you will need to: https://www.guru99.com/data-modelling-conceptual-logical.html

If I understand what you're saying correctly, you're advising to use pandas both for the ETL step and the view creation step, so that PowerBI simply reads a flat file without having to carry out any view definition logic. — dkritz, Apr 08 '20 at 19:47
It depends on the case. If you know what "views" you are going to need and know that they are going to be static, handling them purely in pandas is the best bet in my view. If you need your data to stay in separate tables to handle a variety of views, then modeling in Power BI works. But yes, generally, the first one works better for me because it allows you more precise control over how your data relates instead of leaving it to Power BI. — LTheriault, Apr 08 '20 at 20:46
The important takeaway is that you should take a step back from thinking of data models in the way that Power BI has you think about it. That way is merely an instance of the logical data model that you have in your head/documentation/etc. that is more of an idea than something that needs to always be physically specified in its entirety. — LTheriault, Apr 08 '20 at 20:48
@LTheriault, don't you think that the reason is that you deal with data modeling in power bi is because it's 100% of the time a living dashboard with a SQL Server running behind, and a star schema with fact table + dimension tables are the fastest way to build dashboards (BI purposes) while in python you ingest CSVs and tabular data mostly in memory to render a certain task and then kill the python process (while in power bi you are slicing and dicing the data in real time while exploring the dashboard)? — Pabluez, Mar 28 '23 at 02:00

Star schema in Python Pandas

1 Answers1