So I have this very big text file, which is suposed to be 64 millions password. (https://crackstation.net/buy-crackstation-wordlist-password-cracking-dictionary.htm <- Smaller Wordlist (Human Passwords Only)) I can't open it with notepad++ or any other editor, even thought I have 32GB or ram.
I tried to read it all at once, while removing the duplicate then storing it in a file :
import os
IN_FILE = "./realhuman_phill.txt"
base, ext = os.path.splitext(IN_FILE)
outfile = base + "_no_duplicate" + ext
print "reading " + IN_FILE
all_words = open(IN_FILE).read().splitlines()
print "{} element in file".format(len(all_words))
print "removing duplicates"
myset = set()
myset.update(all_words)
print "{} elements remaining after duplicate removal".format(len(myset))
print "writing data"
with open(outfile, 'w') as f:
for line in myset:
f.write("%s\n" % line)
but then I end up with a ~200MB file (more than 600MB before) with only 19991889 lines (19.9 Millions) So many duplicates? weird
So I made this script to count the number of lines, according to Lazy Method for Reading Big File in Python? it should only load in the file in ram 1 line at a time :
abs_filename = r"D:\realhuman_phill.txt"
print "counting lines in {}".format(abs_filename)
with open(abs_filename) as infile:
counter = 0
for line in infile:
counter = counter + 1
print counter
and it returns 19991889 = 19 991 889, same number, far from 64 millions, with no duplicate removal.
I'm guessing either python or my OS do not let me access the rest of the file, any idea of what is going on?
Thanks
PS : Windows 8.1 64 bits, python 2.7 64 bits