Create table in hive from data folder in HDFS - remove duplicated rows

Question

I have a folder in HDFS, let's call it /data/users/ Inside that folder, a new csv file is added every 10 days. Basically the new file will contain only active users, so, for example

file_01Jan2020.csv: contains data for 1000 users who are currently active
file_10Jan2020.csv: contains data for 950 users who are currently active (same data in file_01Jan2020.csv with 50 less records)
file_20Jan2020.csv: contains data for 920 users who are currently active (same data in file_10Jan2020.csv with 30 less records)

In reality, these file are much bigger (~8 million records per file and decreases by MAYBE 1K EVERY 10 DAYS). Also, the newer files will never have new records that doesn't exist in the older files. it will just have less number of records.

I want to create a table in hive using the data in this folder. What I am doing now is:

Create External table from the data in the folder /data/users/
Create Internal table with the same structure
Write the data from external table to internal table where,
- Duplicates are removed
- If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created

I am concerned about the step where I create the external table, since the data are really big, that table will be huge after sometime, and I was wondering if there is a more efficient way of doing this instead of each time loading all the files in the folder.

So my question is: What is the best possible way to ingest data from a HDFS folder into a hive table that , given that, the folder contain lots of files with lottts of duplications.

"If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created" - If you are loading records from these files only, how record can be not in files ? And do you really need to check all files every time? — leftjoin, Jan 22 '21 at 11:23
it is not clear how do you process deletes, please describe in details, I do not understand how record can disappear from previous file. If it is based on existence in latest file then all you need is to put new file in separate folder (load date) and merge it with target table using hive query using full join+case logic https://stackoverflow.com/a/37744071/2700344 — leftjoin, Jan 22 '21 at 15:41

score 0 · Answer 1 · answered Jan 28 '21 at 07:31

0

I'd suggest partition data by date, that way you dont have to go through all the records every time you read the table.

answered Jan 28 '21 at 07:31

Amardeep Flora

1,255
6
13
29

Create table in hive from data folder in HDFS - remove duplicated rows

1 Answers1