2

I'm using the spark_write_table function from sparklyr to write tables into HDFS, using the partition_by parameter to define how to store them:

 R> my_table %>% 
       spark_write_table(., 
             path="mytable",
             mode="append",
             partition_by=c("col1", "col2")
        )

However, now I want to update the table by altering just one partition, instead of writing the whole table again.

In Hadoop-SQL I would do something like:

INSERT INTO TABLE mytable
PARTITION (col1 = 'my_partition')
VALUES (myvalues..)

Is there an equivalent option to do this in sparklyr correctly? I cannot find it in the documentation.

Re - duplication note: this question is specifically about the way to do this in R with the sparklyr function, while the other question is about general Hive syntax

dalloliogm
  • 8,718
  • 6
  • 45
  • 55
  • 1
    I'm not sure this is even available in [tag:sparkr] so probably not in [tag:sparklyr] since it's a wrapper. – eliasah Jan 26 '18 at 10:15
  • 1
    The only solution (generally with spark) might be to write in a partitioned directory. – eliasah Jan 26 '18 at 10:17
  • @eliasah thanks, but I think there is no option to do it with sparklyr, I will have to use the SQL directly. Save the partition to a temporary table, then call INSERT INTO TABLE mytable PARTITION. – dalloliogm Jan 26 '18 at 11:40
  • I don't think that it will work with metastore neither like you said @user8371915 – eliasah Jan 26 '18 at 12:42
  • Sorry but i don't think this question is a duplicate, because I was explicitely asking how to do this in R with sparklyr. The other question linked is about how to do it in Spark, but I am asking about dong it with a tidyverse verb. – dalloliogm Oct 18 '18 at 10:19

1 Answers1

1

Thanks all for the comments.

It seems there is no way to do this with sparklyr directly, but this is what I am going to do.

In short, I'll save the new partition file in a temporary table, use Hadoop SQL commands to drop the partition, then another SQL command to insert into the temporary table into it.

> dbGetQuery(con, 
   "ALTER TABLE mytable DROP IF EXISTS PARTITION (mycol='partition1');")

> spark_write_table(new_partition, "tmp_partition_table")

> dbGetQuery(con, 
      "INSERT VALUES INTO TABLE mytable
       PARTITION (mycol='partition1') 
       SELECT * 
       FROM tmp_partition_table "
   )
dalloliogm
  • 8,718
  • 6
  • 45
  • 55