2

I have a huge csv file (approx 5-6 GB) in size which is in Hive. Is there a way to count the number of unique lines that exist in the file ?

I do not have any clue for this.

I need to compare the output with another hive table which is having similar content but unique values. So, basically I need to find the number of linnes which are different.

Kaushik
  • 111
  • 10
  • Do you mean want to compute the amount of the hive table data?If so, just execute SQL "select count(1) from hive_table" – Shawn.X May 16 '19 at 08:13
  • I don't know what type of comparsion you need, but the first obstacle i'd come across is the file size. You can read large files in chunks, that makes your line-content/strings comparable. See this question and answers, how to process large files: https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python - as you don't know the chunk size, you may get a single line entry when reading and concatenating chunks until you find a newline character in it. You'll have a single line of one of your files. Do that with your output, too, then compare line by line. HTH! – xph May 16 '19 at 08:22
  • @Shwan.X no, I dont want to compute the amount of hive table data. The data exist in a hive table where the values are redundant,the same data is also available in a csv file. So, I am looking for a turn around which will help me to count the number of unique rows. – Kaushik May 16 '19 at 08:52
  • OK,you want to count the csv file in local not by hive,If you use unix system,you can just execute command "cat file.txt | sort | uniq | wc -l", by the way, Why don't you compute in hive? it's esay too. – Shawn.X May 16 '19 at 08:56
  • Honestly speaking I am new to Hive, not sure about using Hive from Python. Learning nowadays. – Kaushik May 16 '19 at 09:00
  • @Shawn.X Can you provide me the command to select duplicate rows in hive? – Kaushik May 16 '19 at 12:58
  • It's easy using sql on hive tables.. load it in hive table and then join and compare with other table – vikrant rana May 16 '19 at 18:40
  • @Kaushik, OK, it's simple. "select count(distinct col_name) from your_hive_table." – Shawn.X May 17 '19 at 01:57

1 Answers1

2

The below logic works based on hashing. It reads the hashes of each line and not the entire line, which minimises the size. Then the hashes are compared. Hashes mostly will be same for equal string, very rarely strings may vary, so the actual lines are read and actual strings are compared to be sure. The below should work for huge files too.

from collections import Counter
input_file = r'input_file.txt'

# Main logic
# If hash is different then the contents are different
# If hash is same then the contents may be different


def count_with_index(values):
    '''
    Returns dict like key: (count, [indexes])
    '''
    result = {}
    for i, v in enumerate(values):
        count, indexes = result.get(v, (0, []))
        result[v] = (count + 1, indexes + [i])
    return result


def get_lines(fp, line_numbers):
    return (v for i, v in enumerate(fp) if i in line_numbers)


# Gets hashes of all lines
counter = count_with_index(map(hash, open(input_file)))

# Sums only the unique hashes
sum_of_unique_hash = sum((c for _, (c, _) in counter.items() if c == 1))

# Filters all non unique hashes
non_unique_hash = ((h, v) for h, (c, v) in counter.items() if c != 1)

total_sum = sum_of_unique_hash

# For all non unique hashes get the actual line and count
# One hash is picked per time. So memory is not consumed much.
for h, v in non_unique_hash:
    counter = Counter(get_lines(open(input_file), v))
    total_sum += sum(1 for k, v in counter.items())

print('Total number of unique lines is : ', total_sum)

Praveenkumar
  • 2,056
  • 1
  • 9
  • 18
  • it is ignoring the lines which are duplicate and that is fine. But shouldn't it atleast count them once? Lets say a line occurs 5 times in the file, so our Script should count it for one time. – Kaushik May 16 '19 at 08:48
  • Any type of record keeping or tracking could potentially increase the memory usage. Thats why, record keeping is not done. Since the requirement was to find unique lines count, we gather it in `unique_lines_count`. – Praveenkumar May 16 '19 at 09:12
  • @Kaushik I have edited the code as per your need. Created a new algorithm for your need. Just change the input file name with your huge file and try to run it. Let me know your results. – Praveenkumar May 20 '19 at 06:27
  • @Kaushik Great to hear. Thanks for letting me know. Can you also accept this as answer? – Praveenkumar May 20 '19 at 07:17
  • @Kaushik, you have to accept like this, `To mark an answer as accepted, click on the check mark beside the answer to toggle it from greyed out to filled in`. – Praveenkumar May 20 '19 at 07:29