Questions tagged [aws-glue-spark]
244 questions
13
votes
1 answer
How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution
I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$"…

Lina
- 1,217
- 1
- 15
- 28
8
votes
1 answer
AWS Glue worker pricing details for G.1X and G.2X
Have searched the AWS Glue documents, but could not find the pricing details for AWS Glue worker types G.1X and G.2X. Can someone please explain if there is no cost difference between Standard, G.1X & G.2X?
All I can see the Glue pricing section is…

Yuva
- 2,831
- 7
- 36
- 60
7
votes
1 answer
'Can not create a Path from an empty string' Error for 'CREATE TABLE AS' in hive using S3 path
I am trying to create a table in Glue catalog with s3 path location from spark running in EMR using hive. I have tried the following commands, but getting the error:
pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Can not…

AditiSuba
- 73
- 1
- 1
- 3
7
votes
1 answer
How to run parallel threads in AWS Glue PySpark?
I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below).
This job…

sewardth
- 347
- 2
- 13
6
votes
6 answers
AWS Glue error - Invalid input provided while running python shell program
I have Glue job, a python shell code. When I try to run it I end up getting the below error.
Job Name : xxxxx Job Run Id : yyyyyy failed to execute with exception Internal service error : Invalid input provided
It is not specific to code, even if I…

Ludwig
- 782
- 1
- 8
- 24
6
votes
3 answers
How to stop / exit a AWS Glue Job (PySpark)?
I have a successfully running AWS Glue Job that transform data for predictions. I would like to stop processing and output status message (which is working) if I reach a specific condition:
if specific_condition is None:
…

calycolor
- 726
- 1
- 7
- 19
6
votes
1 answer
AWS Glue Python Job not creating new Data Catalog partitions
I created a AWS Glue Job using Glue Studio.
It takes data from a Glue Data Catalog, does some transformations, and writes to a different Data Catalog.
When configuring the target node, I enabled the option to create new partitions after…

gshpychka
- 8,523
- 1
- 11
- 31
6
votes
1 answer
'Log group does not exist' when AWS Glue fails
I'm using jobs from AWS Glue for very fist time, so it is normal that my job does not work but I can't see any detail log about what is wrong, because when I click in "Error Logs" link, or in "Logs" link I always get this message in AWS…

santos82h
- 452
- 5
- 15
5
votes
0 answers
AWS Glue job without script, Spark/Scala JAR only
Is there a way to run a Glue job in AWS where all the necessary code is built into a JAR artifact and uploaded to S3?
Right now the best I can do is something like a placeholder wrapper script like
import project.ActualMainClass
object ScriptMain…

wrschneider
- 17,913
- 16
- 96
- 176
5
votes
1 answer
What options can be passed to AWS Glue DynamicFrame.toDF()?
The documentation on toDF() method specifies that we can pass an options parameter to this method. But it does not specify what those options can be…

AHonarmand
- 530
- 1
- 8
- 16
4
votes
0 answers
Facing issue with integrating code with Aws glue code, ray and pyspark
I am facing the following exception tries various ways but not resolved.
It gives the exception in parallel distributed computing processing using ray library Exception: It appears that you are attempting to reference SparkContext from a broadcast…

VISHAL LIMGIRE
- 529
- 1
- 5
- 21
4
votes
2 answers
How to log messages in AWS Glue worker (inside map function)?
I am able to follow the instructions in https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html, and log messages in driver. But when I try to use the logger inside the map function like this
sc = SparkContext()
glueContext…

Xiqiang Lin
- 41
- 3
4
votes
2 answers
AWS Glue job fails with the error "Command failed with exit code 10"
I get this error message every now and then making the Job very unreliable.
On deeper evaluation, and continuous logging, I see the following error:
2021-09-02 10:38:19,810 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Unknown…

Ankit Goel
- 360
- 1
- 5
- 18
4
votes
2 answers
Pyspark SQL dataframe map with multiple data types
I'm having a pyspark code in glue where I want to create a dataframe with map structure to be a combination of integer and string.
sample data:
{ "Candidates": [
{
"jobLevel": 6,
"name": "Steven",
}, {
"jobLevel": 5,
…

nithya j
- 43
- 4
4
votes
1 answer
How to avoid that AWS Glue DynamicFrame drops empty columns when read a CSV?
If I have a CSV with (simple case) the header and one row of data, where some of the values are not there (null) like this:
name,surname,age
John,,32
and the relative catalog is like this:
MyDataTable:
Type: AWS::Glue::Table
DependsOn:…

Randomize
- 8,651
- 18
- 78
- 133