Questions tagged [data-lake]

161 questions
17
votes
7 answers

Hadoop Vs Data Lake

I heard a new term Data Lake. I googled and got that A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually…
Kishore
  • 5,761
  • 5
  • 28
  • 53
13
votes
1 answer

lakeFS, Hudi, Delta Lake merge and merge conflicts

I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS. Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support…
alexanoid
  • 24,051
  • 54
  • 210
  • 410
11
votes
5 answers

Is Data Lake and Big Data the same?

I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big…
user3342209
  • 133
  • 1
  • 7
8
votes
2 answers

AWS Glue Data Catalog as Metastore for external services like Databricks

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore. My question is, is it possible to expose Glue data catalog as metastore for…
Obaid
  • 237
  • 2
  • 14
7
votes
2 answers

On-premise delta lake

Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed? I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache…
Ajoy
  • 113
  • 1
  • 1
  • 10
6
votes
1 answer

Database vs DataMart vs Data Warehouse vs Data Lake

Looking for the high-level differences/comparison among Database Data Mart (Top-down approach) Data Warehouse Data Lake Data Lakehouse Please use relative comparison when specifics are not available.
Ashok Goli
  • 5,043
  • 8
  • 38
  • 68
6
votes
2 answers

Is DynamoDB suitable as an S3 Metadata index?

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an…
Alex Spurling
  • 54,094
  • 23
  • 70
  • 76
5
votes
3 answers

Data Governance solution for Databricks, Synapse and ADLS gen2

I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for…
VB_
  • 45,112
  • 42
  • 145
  • 293
4
votes
2 answers

AWS Glue Job : An error occurred while calling getCatalogSource. None.get

I was using Password/Username in my aws glue conenctions and now I switched to Secret Manager. Now I get this error when I run my etl job : An error occurred while calling o89.getCatalogSource. None.get Even tho the connections and crawlers works…
4
votes
1 answer

Flatten JSON with array using AWS Glue crawler / classifier / ETL job

I'm crawling following JSON file (it's a valid JSON) from s3 data lake. Inside there are 2 fields (device, timestamp) and an array of objects called "data". Each object in the data array differs from one another. { "device": "0013374838793C8", …
3
votes
2 answers

AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column…
rajmohan k
  • 41
  • 1
  • 5
3
votes
1 answer

Streaming data from Aurora to S3 for Data Lake

I am trying to create a Data Lake using S3, where data is coming from Aurora and eventually other sources; however, I am having troubles with creating a cost efficient solution. I have been looking into using Data Migration Service (DMS) to stream…
3
votes
1 answer

AWS Data Lake Dynamo vs ElasticSearch

I am really struggling to understand how Dynamo / ElasticSearch should be used to support AWS data lake efforts (Metadata / Catalogs). It seems as though you would log the individual S3 locations of your zip archives for your sources in Dynamo and…
3
votes
2 answers

Metadata management for (Azure) data-lake

To my understanding, the data-lake solution is used for storing everything from raw-data in the original format to processed data. I have not able to understand the concept of metadata-management in the (Azure) data-lake though. What are…
3
votes
2 answers

Does ROWCOUNT hint works for EXTRACT in U-SQL

I want to allocate more vertexes to the extraction job, tried using ROWCOUNT hint, it doesn't seem to work, no matter what value I use for ROWCOUNT, U-SQL always allocate the same number of vertexes. EXTRACT xxxx FROM @"Path" USING new…
lidong
  • 556
  • 1
  • 4
  • 20
1
2 3
10 11