Questions tagged [data-warehouse]

A data warehouse (DW) is a database specifically structured/designed to aid in querying, analyzing and reporting (generating reports out of) of current and historical data. DWs are central repositories of integrated data from one or more disparate sources. Basic difference between a data warehouse and a set of DB tables is how the data is organized/structured.

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data used for querying, analyzing and reporting for decision-support purposes.

Data Mart is the access layer of the data warehouse. It serves a particular department like Marketing, HR, etc. Dedicated to business function and unit specifications, data marts make the information more focused and faster to find.

Some differences between Data mart and Data warehouse:

  • Data Warehouses have multiple subject areas with more detailed information. They are integrating all sources of data. Dimensional modelling is not necessary, but it feeds dimensional models
  • Data Marts usually hold one subject area with not that detailed information - often summarized. Concentrate on integrating information from one subject area or source system. Built on dimensional models like star-schema.

There are many products readily available that provide data warehousing capability e.g. MSAccess, Essbase (Hyperion, now Oracle), Cognos, Business Objects, MicroStrategy, ...

Basics of Data Warehousing:

  • Dimensional Modelling - Consist of the identify the measurements, or facts, that are given the context by their related dimensions. The grain of the fact table describes the level of detail at which the facts are recorded.

Main steps of relational modelling:

  • Choose the business process
  • Declare the grain
  • Identify the dimensions
  • Identify the fact

Online Analytical Processing (OLAP) and it's types (ROLAP, MOLAP, HOLAP, ...): Describes basics of the DB designs and pros/cons of each way. - A variety of different design patterns are used in a data warehouse environment. Some common approaches include: Normalized (5NF); DataVault; Anchor Modelling; Dimensional (5,6); other temporal (e.g. 6NF). - SQL: Describes how a Data Warehouse can be queried. Following is a list of basic keywords that every data warehouse developer must know: - JOIN - GROUPBY

At a high level the Data Warehousing can be divided into:

  • Tools (IBM Cognos, Microsoft Business Intelligence, Oracle Business Intelligence Enterprise, dition(OBIEE), Business Objects Enterprise XI, Jaspersoft, Talend Open studio, Pentaho, Qlikview etc) readily available and how to use them. Used for small to medium sized data sets. This usually requires [at least] knowledge of tool's:
    • data model and
    • user interface
  • Building your own data warehouse for specific usecases. Used when dealing with really huge data sets (e.g. the data collected by Google, Yahoo, Facebook or a couters/performance-management-data from a large telecommunication network. This usually requires [at least] knowledge of:
    • scalability, high availability and clustering concepts.
    • data warehouse (schema, queries, data model, ...) design.
    • available databases (Oracle, Clustra, Greenplum, MySQL, DB2, ...)
    • problem domain (implicit).
    • relevant GUI/UI (SWING, JSP, ...) and business logic (J2EE, C++, ...) technologies
2778 questions
190
votes
11 answers

Difference between Fact table and Dimension table?

What is the difference between fact tables and dimension tables? An example could be very helpful.
user2467545
170
votes
13 answers

What is the difference between a database and a data warehouse?

What is the difference between a database and a data warehouse? Aren't they the same thing, or at least written in the same thing (ie. Oracle RDBMS)?
Data Man
  • 1,701
  • 2
  • 11
  • 3
85
votes
3 answers

Data Warehouse vs. OLAP Cube?

Can anyone explain what is really distinction between Data Warehouse and OLAP Cubes? Are they different approach for same thing? Is one of them deprecated in comparison with other? Are there any performance issues in one of them? Any explanation is…
veljasije
  • 6,722
  • 12
  • 48
  • 79
66
votes
9 answers

Should OLAP databases be denormalized for read performance?

I always thought that databases should be denormalized for read performance, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design. PerformanceDBA in various posts, for ex., in Performance of different…
63
votes
8 answers

Star-Schema Design

Is a Star-Schema design essential to a data warehouse? Or can you do data warehousing with another design pattern?
S.Lott
  • 384,516
  • 81
  • 508
  • 779
51
votes
7 answers

Data Warehouse Considerations: When and Why?

A little background here: I know what a data warehouse is, more or less. I've read several dozen guides on data warehousing, I've played with SSAS, I know what a star schema and a dimension table and a fact table is, I know what ETL is and how to…
Aaronaught
  • 120,909
  • 25
  • 266
  • 342
49
votes
2 answers

Schema evolution in parquet format

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving…
ToBeSparkShark
  • 641
  • 2
  • 6
  • 10
44
votes
4 answers

NoSql and Data-Warehouse

What are the relations between NoSql and Data-Warehouse technologies/theories? What concepts they share? What are the basic differences between them? How do you think each could be benefits/enriches from the other? I think your ideas should be…
Aito
  • 6,812
  • 3
  • 30
  • 41
42
votes
6 answers

Database choice for large data volume?

I'm about to start a new project which should have a rather large database. The number of tables will not be large (<15), majority of data (99%) will be contained in one big table, which is almost insert/read only (no updates). The estimated amount…
Marko
  • 30,263
  • 18
  • 74
  • 108
39
votes
7 answers

What are the differences between Data Lineage and Data Provenance?

From wiki, Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline…
CSY
  • 543
  • 1
  • 6
  • 11
39
votes
4 answers

What are the open source tools and techniques to build a complete data warehouse platform?

I'm looking for these open source tools possibly free or with free trial version to set up complete data warehouse stack. I know about few like Pentaho open source Mondrian server, but couldn't get any google result to setup complete platform. I'm…
understack
  • 11,212
  • 24
  • 77
  • 100
38
votes
3 answers

what is the right data type for unique key in postgresql DB?

which data type should I choose for a unique key (id of a user for example) in postgresql database's table? does bigint is the one? thanks
socksocket
  • 4,271
  • 11
  • 45
  • 70
37
votes
3 answers

What is a staging table?

Are staging tables used only in Data warehouse project or in any SSIS Project? I would like to know what is a staging table? Can anyone give me some examples on how to use it and in what circumstances it is implemented? Also, may I please know the…
Suj
  • 416
  • 1
  • 5
  • 11
31
votes
7 answers

20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem: Data is organized in a star schema structure with one BIG fact and ~15 dimensions. 20B fact rows…
Haggai
30
votes
3 answers

Data Warehousing - Star Schema vs Flat Table

I'm trying to design a Data Warehouse for a single store of commonly required data ranging from finance systems, project scheduling systems and a myriad of scientific systems. I.e. many different data marts. I have been reading up on Data…
Calanus
  • 25,619
  • 25
  • 85
  • 120
1
2 3
99 100