Questions tagged [aws-glue-spark]

244 questions
13
votes
1 answer

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$"…
Lina
  • 1,217
  • 1
  • 15
  • 28
8
votes
1 answer

AWS Glue worker pricing details for G.1X and G.2X

Have searched the AWS Glue documents, but could not find the pricing details for AWS Glue worker types G.1X and G.2X. Can someone please explain if there is no cost difference between Standard, G.1X & G.2X? All I can see the Glue pricing section is…
Yuva
  • 2,831
  • 7
  • 36
  • 60
7
votes
1 answer

'Can not create a Path from an empty string' Error for 'CREATE TABLE AS' in hive using S3 path

I am trying to create a table in Glue catalog with s3 path location from spark running in EMR using hive. I have tried the following commands, but getting the error: pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Can not…
7
votes
1 answer

How to run parallel threads in AWS Glue PySpark?

I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below). This job…
sewardth
  • 347
  • 2
  • 13
6
votes
6 answers

AWS Glue error - Invalid input provided while running python shell program

I have Glue job, a python shell code. When I try to run it I end up getting the below error. Job Name : xxxxx Job Run Id : yyyyyy failed to execute with exception Internal service error : Invalid input provided It is not specific to code, even if I…
Ludwig
  • 782
  • 1
  • 8
  • 24
6
votes
3 answers

How to stop / exit a AWS Glue Job (PySpark)?

I have a successfully running AWS Glue Job that transform data for predictions. I would like to stop processing and output status message (which is working) if I reach a specific condition: if specific_condition is None: …
calycolor
  • 726
  • 1
  • 7
  • 19
6
votes
1 answer

AWS Glue Python Job not creating new Data Catalog partitions

I created a AWS Glue Job using Glue Studio. It takes data from a Glue Data Catalog, does some transformations, and writes to a different Data Catalog. When configuring the target node, I enabled the option to create new partitions after…
6
votes
1 answer

'Log group does not exist' when AWS Glue fails

I'm using jobs from AWS Glue for very fist time, so it is normal that my job does not work but I can't see any detail log about what is wrong, because when I click in "Error Logs" link, or in "Logs" link I always get this message in AWS…
santos82h
  • 452
  • 5
  • 15
5
votes
0 answers

AWS Glue job without script, Spark/Scala JAR only

Is there a way to run a Glue job in AWS where all the necessary code is built into a JAR artifact and uploaded to S3? Right now the best I can do is something like a placeholder wrapper script like import project.ActualMainClass object ScriptMain…
wrschneider
  • 17,913
  • 16
  • 96
  • 176
5
votes
1 answer

What options can be passed to AWS Glue DynamicFrame.toDF()?

The documentation on toDF() method specifies that we can pass an options parameter to this method. But it does not specify what those options can be…
AHonarmand
  • 530
  • 1
  • 8
  • 16
4
votes
0 answers

Facing issue with integrating code with Aws glue code, ray and pyspark

I am facing the following exception tries various ways but not resolved. It gives the exception in parallel distributed computing processing using ray library Exception: It appears that you are attempting to reference SparkContext from a broadcast…
VISHAL LIMGIRE
  • 529
  • 1
  • 5
  • 21
4
votes
2 answers

How to log messages in AWS Glue worker (inside map function)?

I am able to follow the instructions in https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html, and log messages in driver. But when I try to use the logger inside the map function like this sc = SparkContext() glueContext…
4
votes
2 answers

AWS Glue job fails with the error "Command failed with exit code 10"

I get this error message every now and then making the Job very unreliable. On deeper evaluation, and continuous logging, I see the following error: 2021-09-02 10:38:19,810 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Unknown…
Ankit Goel
  • 360
  • 1
  • 5
  • 18
4
votes
2 answers

Pyspark SQL dataframe map with multiple data types

I'm having a pyspark code in glue where I want to create a dataframe with map structure to be a combination of integer and string. sample data: { "Candidates": [ { "jobLevel": 6, "name": "Steven", }, { "jobLevel": 5, …
4
votes
1 answer

How to avoid that AWS Glue DynamicFrame drops empty columns when read a CSV?

If I have a CSV with (simple case) the header and one row of data, where some of the values are not there (null) like this: name,surname,age John,,32 and the relative catalog is like this: MyDataTable: Type: AWS::Glue::Table DependsOn:…
Randomize
  • 8,651
  • 18
  • 78
  • 133
1
2 3
16 17