Questions tagged [spark-redshift]

28 questions
2
votes
1 answer

How to make an existing column NOT NULL in AWS REDSHIFT?

I had dynamically created a table through glue job and it is successfully working fine. But as per new requirement, I need to add a new column which generates unique values and should be primary key in redshift. I had implemented the same using…
2
votes
4 answers

: java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.(S3;Ljava/util/concurrent/ThreadPoolExecutor;)V

I am trying to read redshift table data into red-shift data frame and writing that dataframe in another redshift table. Using following .jar in spark_submit for this task. Here is the command: spark-submit --jars…
2
votes
0 answers

Connecting SparkR with Redshift: Failed to find data source: com.databricks.spark.redshift

I have an Spark cluster setup with Amazon EMR with RStudio installed on top of it. I am trying to connect sparkR with Redshift through the package spark-redshift_2.11-0.5.0.jar during which I am facing the error failed to find the data source:…
2
votes
1 answer

How to write a pyspark-dataframe to redshift?

I am trying to write a pyspark DataFrame to Redshift but it results into error:- java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated Caused…
murtaza1983
  • 247
  • 2
  • 8
1
vote
1 answer

check whether is spark format exists or not

Context Spark reader has the function format, which is used to specify a data source type, for example, JSON, CSV or third party com.databricks.spark.redshift Help how can I check whether a third-party format exists or not, let me give a case In…
1
vote
1 answer

Error writing dataframe in redshift using pyspark with boolean columns

In my script the write method of PySpark takes a data frame and writes it a Redshift, however in some dataframes there are boolean columns that return error stating that Redshift does not accept bit data type. My question is because it says that…
1
vote
0 answers

400 : Bad Request, py4j.protocol.Py4JJavaError: An error occurred while calling o44.save

I am able to connect to redshift using pyspark after some research and can read a table data into spark dataframe. Now, I am trying to insert that data frame into another redshift table(with same structure). Here is the code I am using to connect to…
1
vote
1 answer

issue while connecting spark to redshift using spark -redshift connector

I need to connect spark to my redshift instance to generate data . I am using spark 1.6 with scala 2.10 . Have used compatible jdbc connector and spark-redshift connector. But i am facing a weird problem that is : I am using…
Aldrin Machado
  • 97
  • 1
  • 10
1
vote
0 answers

Pyspark issue with timestamps cast when reading MySQL DB

Python 2.7 Pyspark 2.2.1 JDBC format for MySQL->Spark DF For writing Spark DF-> AWS Redshift i am using the `Spark-Redshift` driver from Databricks. I am reading data into Spark from MySQL tables for my application, due to the context and depending…
1
vote
2 answers

Unable to connect to S3 using spark-redshift library in java

I am trying to create a table in Redshift based on the spark dataset. I am using spark-redshift driver in jdbc to achieve this locally. The code snippet to do this data.write() .format("com.databricks.spark.redshift") .option("url",…
Ritika
  • 73
  • 1
  • 9
0
votes
0 answers

Spark Error: Could not initialize class org.apache.spark.rdd.RDDOperationScope$

I'm trying to print rows from my Spark dataframe in Amazon Sagemaker. I have created a Spark dataframe by reading a table from a Redshift database. Printing the full table alone shows the column names and types. However, trying to show the actual…
0
votes
0 answers

Databricks format in Pyspark to write in Redshift

I am migrating data from postgres to redshift by using jdbc format but for the redshift if I ise jdbc format then some of the options are not available like escape. So I thought to use format com.databricks.spark.redshift to write by using pyspark.…
0
votes
0 answers

Writing data to Redshift using JDBC

I am trying to write dataframe to Redshift table with following code using jdbc connection. It is running very slow(running more than 20hours to process). Dataframe has 100 partitions. Can you suggest how do we improve the performance for writing df…
Bab
  • 177
  • 2
  • 6
  • 17
0
votes
1 answer

How to optimize Redshift table for simple DELETE or SELECT queries?

I have a DELETE queries in Redshift that takes up to 40 seconds in productions. The queries are created programatically is looks like EXPLAIN DELETE FROM platform.myTable WHERE id IN…
jn5047
  • 101
  • 1
  • 7
0
votes
0 answers

Is there any way I can retain spaces in redshift while writing from aws glue

I am trying to store space in a varchar column in redshift. My data comes in a csv format and looks like this, "id","first_name","last_name","doj","address" "A1111","B1111","C1111","D111","E111" "A2222","B22222",""," ","E22" "A3333"," …
1
2