2

I have a bunch of parquet data in a structure something like col1=1/col2=2/col3=3/part-00000-33b48309-0442-4e86-870f-f3070268107f-c000.snappy.parquet

I've read up on what I could find, and it seems pretty clear what each part of the file name means - part-00000 increments per file in the partition, c000 is something to do with other part of output configuration, and the rest is a UUID to prevent collisions during parallel writes.

I'm wondering - what parts of the filename can I change, or get rid of? Specifically, is it safe to just remove the UUID?

(The larger motivation is that I need to add data over time to an existing store, but want to maintain N files per partition, and since you can't overwrite the files you're reading, I need to stage the new files and then copy them over, and this would be easier with known file names)

Narfanator
  • 5,595
  • 3
  • 39
  • 71

1 Answers1

2

May be you can apply the solution from Spark parquet partitioning : Large number of files

data
  .repartition($"key",floor($"row_number"/N)*N)
  .write.partitionBy("key")
  .parquet("/location")
pedvaljim
  • 108
  • 7
  • That doesn't do the thing. I have `A` and `B`, and I need to merge them together and then replace `A` – Narfanator Dec 04 '18 at 19:30
  • You can load A and B as different dataframes, join them and finally write to A location using coalesce – pedvaljim Dec 05 '18 at 09:50
  • I've tried that. `spark.read.parquet(a, b).coalesce(1).write.parquet(a)`, right? I'll try it again, but I remember getting errors. – Narfanator Dec 05 '18 at 19:58
  • Yeah, can't do things in-place. Have to have a third location to use as a swap. – Narfanator Jan 21 '19 at 22:21
  • Make sense, as Spark won't start actually executing the read() until an action is perfomed (in this case write()) and I don't think it will want to override the same location is reading at same time – pedvaljim Jan 22 '19 at 14:32