-1

So I have this very big text file, which is suposed to be 64 millions password. (https://crackstation.net/buy-crackstation-wordlist-password-cracking-dictionary.htm <- Smaller Wordlist (Human Passwords Only)) I can't open it with notepad++ or any other editor, even thought I have 32GB or ram.

I tried to read it all at once, while removing the duplicate then storing it in a file :

import os

IN_FILE = "./realhuman_phill.txt"
base, ext = os.path.splitext(IN_FILE)
outfile = base + "_no_duplicate" + ext
print "reading " + IN_FILE
all_words = open(IN_FILE).read().splitlines()
print "{} element in file".format(len(all_words))
print "removing duplicates"
myset = set()
myset.update(all_words)
print "{} elements remaining after duplicate removal".format(len(myset))

print "writing data"
with open(outfile, 'w') as f:
    for line in myset:
        f.write("%s\n" % line)

but then I end up with a ~200MB file (more than 600MB before) with only 19991889 lines (19.9 Millions) So many duplicates? weird

So I made this script to count the number of lines, according to Lazy Method for Reading Big File in Python? it should only load in the file in ram 1 line at a time :

abs_filename = r"D:\realhuman_phill.txt"
print "counting lines in {}".format(abs_filename)
with open(abs_filename) as infile:
    counter = 0
    for line in infile:
        counter = counter + 1 
print counter

and it returns 19991889 = 19 991 889, same number, far from 64 millions, with no duplicate removal.

I'm guessing either python or my OS do not let me access the rest of the file, any idea of what is going on?

Thanks

PS : Windows 8.1 64 bits, python 2.7 64 bits

sliders_alpha
  • 2,276
  • 4
  • 33
  • 52
  • can you post a sample of the fine _in the question_ ? – Jean-François Fabre Dec 05 '17 at 13:18
  • Aren't you running your (rather inefficient) line count check on the input file and not the output file? eg: aren't you supposed to be running it on the filename that results from `outfile = base + "_no_duplicate" + ext` ? – Jon Clements Dec 05 '17 at 13:29
  • @Jean-FrançoisFabre I can give you a sample of my 200MB output file, but not of the original file since I cannot open it – sliders_alpha Dec 05 '17 at 13:34
  • @JonClements no, I want to count the line of the input file, those 2 script are independant and works on the original file – sliders_alpha Dec 05 '17 at 13:35
  • Sorry - I'm confused. Your input file contains the duplicates of which you're then deduplicating and creating a new file without duplicates, then you're comparing the line count of the input file to prove no duplicates have been removed from the output file? – Jon Clements Dec 05 '17 at 13:36
  • @JonClements the file is said to be "64 millions lines", first I decided to remove the duplicate (if there was any) and I ended up with a 19 Million line file. I found that weird so I then counted the line of the original/64M line file and I also ended up with a count of 19 Million. but that is not right because the 'no duplicate' file is 200MB while the original file is 600MB so it has to be more than 19 Million – sliders_alpha Dec 05 '17 at 13:39

1 Answers1

1

Issue could be with the Line Endings. Try enforcing the file read mode as binary.

with open(abs_filename, 'rb') as infile:
martineau
  • 119,623
  • 25
  • 170
  • 301
Siva Kovvuru
  • 86
  • 1
  • 4
  • I really don't see how it fixed the issue. Unless there _are_ duplicates but with different line endings, and opening as binary makes the lines non-duplicates. – Jean-François Fabre Dec 05 '17 at 19:47