Questions tagged [aws-databricks]

For questions about the usage of Databricks Lakehouse Platform on AWS cloud.

Databricks Lakehouse Platform on AWS

Lakehouse Platform for accelerating innovation across data science, data engineering, business analytics, and data warehousing integrated with your AWS infrastructure.

Reference: https://databricks.com/aws

190 questions
18
votes
1 answer

Local instance of Databricks for development

I am currently working on a small team that is developing a Databricks based solution. For now we are small enough to work off of cloud instances of Databricks. As the group grows this will not really be practical. Is there a "local" install of…
John
  • 3,458
  • 4
  • 33
  • 54
12
votes
5 answers

How to access shared Google Drive files through Python?

I try to access shared Google Drive files through Python. I have created an OAuth 2.0 ClientID as well as the OAuth consent. I have copy-pasted this code:…
7
votes
2 answers

Run databricks job from notebook

I want to know if it is possible to run a Databricks job from a notebook using code, and how to do it I have a job with multiple tasks, and many contributors, and we have a job created to execute it all, now we want to run the job from a notebook to…
Joe
  • 561
  • 1
  • 9
  • 26
7
votes
3 answers

Execute multiple notebooks in parallel in pyspark databricks

Question is simple: master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark? Below image is explaning what am trying to do, it errors for some reason, am i missing something here?
6
votes
2 answers

How do we access databricks job parameters inside the attached notebook?

In Databricks if I have a job request json as: { "job_id": 1, "notebook_params": { "name": "john doe", "age": "35" } } How do I access the notebook_params inside the job attached notebook?
Sannix19
  • 75
  • 1
  • 6
5
votes
0 answers

AWS Key issue while working with Databricks Mount

Currently I am facing an issue while dealing with Databricks Mount point created on top of AWS S3 bucket. I could create the Mount Point in Databricks notebook with below code - ACCESS_KEY = "<>" SECRET_KEY =…
Abhi
  • 341
  • 1
  • 6
  • 23
4
votes
0 answers

Delta Live CDC for Aggregate State Tables

As far as I can tell from the documentation, I can not accomplish a specific migration from Delta to Delta Live that I would love to do... but I want to see if I might be missing a solution. Currently, i have a number of aggregate batch Delta tables…
4
votes
0 answers

How to clean up extremely large delta log checkpoints and many small files?

AWS by the way, if that matters. We have an old production table that has been running in the background for a couple of years, always with auto-optimize and auto-compaction turned off. Since then, it has written many small files (like 10,000 an…
Fenno Vermeij
  • 126
  • 3
  • 4
  • 11
4
votes
2 answers

Get a list of files in S3 using PySpark in Databricks

I'm trying to generate a list of all S3 files in a bucket/folder. There are usually in the magnitude of millions of files in the folder. I use boto right now and it's able to retrieve around 33k files per minute, which for even a million files,…
CodingInCircles
  • 2,565
  • 11
  • 59
  • 84
4
votes
0 answers

Databricks Stream to Batch process

I am using Databricks and I am enjoying Autoloader feature. Basically, it is creating infrastructure to consume data in micro batch fashion. It works nice for the initial raw table (or name it bronze). When I am a bit lost how to append my other…
4
votes
1 answer

Why does Databricks only plot 1000 rows?

Is there any way in Databricks to plot more than 1000 rows with the built in visualization? I tried using limit() function, but it still shows only the first 1000.
JAdel
  • 1,309
  • 1
  • 7
  • 24
4
votes
1 answer

Databricks Magic Sql - Export Data

Is it possible to export the output of a "magic SQL" command cell in Databricks? I like the fact that one doesn't have to escape the SQL command and it can be easily formatted. But, I cant seem to be able to use the output in other cells. What I…
Raj Rao
  • 8,872
  • 12
  • 69
  • 83
4
votes
2 answers

Calling Trigger once in Databricks to process Kinesis Stream

I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern import org.apache.spark.sql.streaming.Trigger // Load your Streaming DataFrame val sdf =…
3
votes
1 answer

Databricks Repo vs Workspace

I noticed that in Databricks, there is a folder section for 'Workspace' and a folder for 'Repos' - as seen below: I have been trying to research online what the difference is, but no luck. It seems as though they serve the same purpose? I am able…
3
votes
1 answer

Using an expression in a PARTITIONED BY definition in Delta Table

Attempting to load data into Databricks using COPY INTO, I have data in storage (as CSV files) that has the following schema: event_time TIMESTAMP, aws_region STRING, event_id STRING, event_name STRING I wish for the target table to be partitioned…
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
1
2 3
12 13