3

If I build a Hive table on top of some S3 (or HDFS) directory like so:

create external table newtable (name string) 
row format delimited 
fields terminated by ',' 
stored as textfile location 's3a://location/subdir/';

When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
covfefe
  • 2,485
  • 8
  • 47
  • 77
  • Were the files added directly to `s3a://location/subdir/` or to any subdirectories under this location? – franklinsijo Mar 08 '17 at 16:26
  • This does not make sense. The metastore holds the location, not its content. Every file within the location is supposed to be scanned when you query the table. – David דודו Markovitz Mar 08 '17 at 16:31
  • @franklinsijo The files were added directory to `s3a://location/subdir/`. @Dudu Every file is supposed to be scanned which is why if I add another file to that subdirectory, I would expect that data to show up when I run 'select *' on the table. But it doesn't; it shows the same table (without the newly added data). – covfefe Mar 08 '17 at 18:23

2 Answers2

2

On HDFS each file scanned each time table being queried as @Dudu Markovitz pointed. And files in HDFS are immediately consistent.

Update: S3 is also strongly consistent now, so removed part about eventual consistency.

Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116
0

Everything @leftjoin says is correct, with one extra detail: s3 doesn't offer immediate consistency on listings. A new blob can be uploaded, HEAD/GET will return it but a list operation on the parent path may not see it. This means that Hive code which lists the directory may not see the data. Using unique names doesn't fix this, only using a consistent DB like Dynamo which is updated as files are added/removed. Even there, you have added a new thing to keep in sync...

stevel
  • 12,567
  • 1
  • 39
  • 50