Cannot read entire content of 600MB text file

Question

So I have this very big text file, which is suposed to be 64 millions password. (https://crackstation.net/buy-crackstation-wordlist-password-cracking-dictionary.htm <- Smaller Wordlist (Human Passwords Only)) I can't open it with notepad++ or any other editor, even thought I have 32GB or ram.

I tried to read it all at once, while removing the duplicate then storing it in a file :

import os

IN_FILE = "./realhuman_phill.txt"
base, ext = os.path.splitext(IN_FILE)
outfile = base + "_no_duplicate" + ext
print "reading " + IN_FILE
all_words = open(IN_FILE).read().splitlines()
print "{} element in file".format(len(all_words))
print "removing duplicates"
myset = set()
myset.update(all_words)
print "{} elements remaining after duplicate removal".format(len(myset))

print "writing data"
with open(outfile, 'w') as f:
    for line in myset:
        f.write("%s\n" % line)

but then I end up with a ~200MB file (more than 600MB before) with only 19991889 lines (19.9 Millions) So many duplicates? weird

So I made this script to count the number of lines, according to Lazy Method for Reading Big File in Python? it should only load in the file in ram 1 line at a time :

abs_filename = r"D:\realhuman_phill.txt"
print "counting lines in {}".format(abs_filename)
with open(abs_filename) as infile:
    counter = 0
    for line in infile:
        counter = counter + 1 
print counter

and it returns 19991889 = 19 991 889, same number, far from 64 millions, with no duplicate removal.

I'm guessing either python or my OS do not let me access the rest of the file, any idea of what is going on?

Thanks

PS : Windows 8.1 64 bits, python 2.7 64 bits

Aren't you running your (rather inefficient) line count check on the input file and not the output file? eg: aren't you supposed to be running it on the filename that results from `outfile = base + "_no_duplicate" + ext` ? — Jon Clements, Dec 05 '17 at 13:29
@Jean-FrançoisFabre I can give you a sample of my 200MB output file, but not of the original file since I cannot open it — sliders_alpha, Dec 05 '17 at 13:34
@JonClements no, I want to count the line of the input file, those 2 script are independant and works on the original file — sliders_alpha, Dec 05 '17 at 13:35
Sorry - I'm confused. Your input file contains the duplicates of which you're then deduplicating and creating a new file without duplicates, then you're comparing the line count of the input file to prove no duplicates have been removed from the output file? — Jon Clements, Dec 05 '17 at 13:36
@JonClements the file is said to be "64 millions lines", first I decided to remove the duplicate (if there was any) and I ended up with a 19 Million line file. I found that weird so I then counted the line of the original/64M line file and I also ended up with a count of 19 Million. but that is not right because the 'no duplicate' file is 200MB while the original file is 600MB so it has to be more than 19 Million — sliders_alpha, Dec 05 '17 at 13:39

score 1 · Accepted Answer · edited Dec 05 '17 at 14:12

1

Issue could be with the Line Endings. Try enforcing the file read mode as binary.

with open(abs_filename, 'rb') as infile:

edited Dec 05 '17 at 14:12

martineau

119,623
25
170
301

answered Dec 05 '17 at 13:37

Siva Kovvuru

86
1
4

I really don't see how it fixed the issue. Unless there _are_ duplicates but with different line endings, and opening as binary makes the lines non-duplicates. – Jean-François Fabre Dec 05 '17 at 19:47

Cannot read entire content of 600MB text file

1 Answers1