0

I'm creating a wordlist with uppercase letters A-Z and numbers 0-9. The length is exactly 8 characters long. Using the tool crunch, preinstalled in Kali, I was able to generate a wordlist that doesn't contain any consecutive characters, for example: 'ABCDEF12' would be generated but 'AABBCC11' wouldn't be generated because it contains consecutive characters.

The command used: crunch 8 8 ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 -d 1

I still need to filter down this wordlist by excluding any more than 2 occurrences of the same character, for example: ABCA12AB would be excluded because the letters 'A' and 'B' occur 3 times, and I only want them to occur 2 times maximum.

There isn't any option within crunch to do this, and I've tried looking up regex to filter the results but I'm very new to regex and couldn't figure it out.

pppery
  • 3,731
  • 22
  • 33
  • 46
  • 1
    Is the purpose to reduce total runtime? If so, have you done the math of how much time / compute this will actually save? In practice, it is not worth the effort. – Royce Williams Apr 04 '23 at 20:15
  • 1
    Not to reduce runtime at all, it's still going to be a massive wordlist. It's just the type of wordlist that I need, that's all. –  Apr 04 '23 at 21:17

1 Answers1

2

I'm sure there is a clever way to do this using regular expressions. But, here's a quick and dirty way to do it in Python:

filename='/usr/share/dict/american-english'

def StringContainsNoMoreThanNOccurancesOfSameCharacter(s, N):
    H={}
    for i in range(0, len(s)):
        c=s[i]
        if c in H:
            H[c]+=1
            if(H[c]==N+1): return False
        else:
            H[c]=1
    return True            
  
with open(filename) as file:
    for line in file:
        line=line.strip()
        if(StringContainsNoMoreThanNOccurancesOfSameCharacter(line, 2)): print (line)

Just change the first line of the script to point to your source file containing your wordlist (I used /usr/share/dict/american-english to test), then save the python script on your system, and run it from the command line like so:

python3 /path/to/script.py

It should output only those words in your source file that contain no more than two occurrences of the same character.

pppery
  • 3,731
  • 22
  • 33
  • 46
mti2935
  • 11,465
  • 3
  • 29
  • 33
  • Thanks for the script, I appreciate your effort. Unfortunately I'm not looking for all unique characters. The characters can repeat twice, but no more than that. For example, the character 'A' can appear 2 times in the word generated but cannot appear 3 times. Thank you for the quick response anyway! –  Apr 04 '23 at 21:21
  • Sorry, I misunderstood your requirement. I edited the question with a revised script that should do what you want e.g. from /usr/share/dict/american-english it outputs `hazard`, but not `unmagageable`. – mti2935 Apr 04 '23 at 21:43