Highest Voted 'data-lake' Questions

17

votes

7 answers

Hadoop Vs Data Lake

I heard a new term Data Lake. I googled and got that A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually…

hadoop data-warehouse data-lake

asked Mar 14 '16 at 12:24

Kishore

5,761
5
28
53

13

votes

1 answer

lakeFS, Hudi, Delta Lake merge and merge conflicts

I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS. Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support…

delta-lake data-lake apache-hudi lakefs data-lakehouse

asked Oct 03 '21 at 17:34

alexanoid

24,051
54
210
410

11

votes

5 answers

Is Data Lake and Big Data the same?

I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big…

bigdata data-lake

asked Sep 18 '18 at 15:30

user3342209

133
1
7

8

votes

2 answers

AWS Glue Data Catalog as Metastore for external services like Databricks

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore. My question is, is it possible to expose Glue data catalog as metastore for…

amazon-s3 databricks aws-glue data-lake hive-metastore

asked Apr 16 '18 at 02:36

Obaid

237
2
14

7

votes

2 answers

On-premise delta lake

Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed? I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache…

delta-lake data-lake

asked Feb 09 '21 at 19:36

Ajoy

113
1
1
10

6

votes

1 answer

Database vs DataMart vs Data Warehouse vs Data Lake

Looking for the high-level differences/comparison among Database Data Mart (Top-down approach) Data Warehouse Data Lake Data Lakehouse Please use relative comparison when specifics are not available.

database comparison data-warehouse data-lake datamart

asked May 12 '20 at 12:23

Ashok Goli

5,043
8
38
68

6

votes

2 answers

Is DynamoDB suitable as an S3 Metadata index?

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an…

amazon-s3 amazon-dynamodb data-lake

asked Nov 10 '16 at 15:05

Alex Spurling

54,094
23
70
76

5

votes

3 answers

Data Governance solution for Databricks, Synapse and ADLS gen2

I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for…

azure architecture databricks data-lake azure-data-catalog

asked May 11 '20 at 22:20

VB_

45,112
42
145
293

4

votes

2 answers

AWS Glue Job : An error occurred while calling getCatalogSource. None.get

I was using Password/Username in my aws glue conenctions and now I switched to Secret Manager. Now I get this error when I run my etl job : An error occurred while calling o89.getCatalogSource. None.get Even tho the connections and crawlers works…

python amazon-web-services aws-glue aws-glue-data-catalog data-lake

asked Sep 19 '22 at 19:43

Brahim BEN ADDI

41
1
3

4

votes

1 answer

Flatten JSON with array using AWS Glue crawler / classifier / ETL job

I'm crawling following JSON file (it's a valid JSON) from s3 data lake. Inside there are 2 fields (device, timestamp) and an array of objects called "data". Each object in the data array differs from one another. { "device": "0013374838793C8", …

json amazon-web-services amazon-athena aws-glue data-lake

asked Mar 19 '19 at 11:47

Maciej Malak

96
1
8

3

votes

2 answers

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column…

pyspark aws-glue aws-glue-data-catalog data-lake

asked Sep 25 '19 at 07:20

rajmohan k

41
1
5

3

votes

1 answer

Streaming data from Aurora to S3 for Data Lake

I am trying to create a Data Lake using S3, where data is coming from Aurora and eventually other sources; however, I am having troubles with creating a cost efficient solution. I have been looking into using Data Migration Service (DMS) to stream…

amazon-web-services amazon-s3 streaming amazon-aurora data-lake

asked Oct 01 '18 at 17:09

Alex Oh

411
1
4
6

3

votes

1 answer

AWS Data Lake Dynamo vs ElasticSearch

I am really struggling to understand how Dynamo / ElasticSearch should be used to support AWS data lake efforts (Metadata / Catalogs). It seems as though you would log the individual S3 locations of your zip archives for your sources in Dynamo and…

amazon-web-services elasticsearch amazon-s3 amazon-dynamodb data-lake

asked Oct 09 '17 at 18:38

scarpacci

8,957
16
79
144

3

votes

2 answers

Metadata management for (Azure) data-lake

To my understanding, the data-lake solution is used for storing everything from raw-data in the original format to processed data. I have not able to understand the concept of metadata-management in the (Azure) data-lake though. What are…

azure metadata azure-data-lake database-metadata data-lake

asked Mar 27 '17 at 06:08

AlexGuevara

932
11
28

3

votes

2 answers

Does ROWCOUNT hint works for EXTRACT in U-SQL

I want to allocate more vertexes to the extraction job, tried using ROWCOUNT hint, it doesn't seem to work, no matter what value I use for ROWCOUNT, U-SQL always allocate the same number of vertexes. EXTRACT xxxx FROM @"Path" USING new…

azure-data-lake u-sql data-lake

asked Mar 07 '17 at 21:30

lidong

556
1
4
20

Questions tagged [data-lake]