2

Let's say if I created a hive table as ORC format and inserted 1M records into the table, which created a file with 17 stripes. The last stripe is not full.

Then I insterted another 100 records into this table, will the new 100 records be appended into the last stripe or a new stripe will be created ?

I have tried to test it on a HDFS cluster, seems like every time we insert new records, a new file will be created (of course, new stripes are created too). Was wondering why?

Nate
  • 67
  • 1
  • 8

1 Answers1

2

Reason would be HDFS doesn't support editing file.

So when we insert data into Hive table all the time new files will be created.

In case if you want to merge these files you can use concatenate

Alter table <table_name> CONCATENATE;

(or)

You can insert overwrite the same table that you have selected from to merge all small files into big file.

insert overwrite <db_table>.<table1> select * from <db_table>.<table1>

You can also use sort by distribute by to control number of files created in HDFS directory.

notNull
  • 30,258
  • 4
  • 35
  • 50