I have a huge text file that has duplicate lines. The size would be about 150000000 lines. I'd like to find the most efficient way to read these lines in and eliminate duplicates. Some of the approaches I'm considering are as follows :-
- Read the whole file in, do a list(set(lines)).
- Read 10k lines in at a time, do a list(set(lines)) on what I have, read another 10k lines into the list, do a list(set(lines)). Repeat.
How would you approach this problem? Would any form of multiprocessing help?