1

I'm trying to remove social security numbers (SSN) for GDPR compliant reasons from messy data generated with speech-to-text. Here is a sample string (translated to English which explains why 'and' occurs when the SSN are listed):

sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"

My goal is to remove the part "thirteen ... forty " while keeping other numbers that may appear in the string resulting in:

sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve"

The length of the social security number can vary as a consequence of how data is generated (3-10 separated numbers).

My approach:

  1. Replace written numbers with digits using a dict
  2. Use regex to find where 3 or more numbers occur with only whitespace or "and" separating them and remove these together with any number following these 3 numbers.

Here is my code:

import re

number_dict = {
    'zero': '0',
    'one': '1',
    'two': '2',
    'three': '3',
    'four': '4',
    'five': '5',
    'six': '6',
    'seven': '7',
    'eight': '8',
    'nine': '9',
    'ten': '10',
    'eleven': '11',
    'twelve': '12',
    'thirteen': '13',
    'fourteen': '14',
    'fifteen': '15',
    'sixteen': '16',
    'seventeen': '17',
    'eighteen': '18',
    'nineteen': '19',
    'twenty': '20',
    'thirty': '30',
    'forty': '40',
    'fifty': '50',
    'sixty': '60',
    'seventy': '70',
    'eighty': '80',
    'ninety': '90'
}


sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1_temp = [number_dict.get(item,item)  for item in sample1.split()]
sample1_numb = ' '.join(sample1_temp)
re_results = re.findall(r'(\d+ (and\s)?\d+ (and\s)?\d+\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?)', sample1_numb) 

print(re_results)

Output:

[('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]

This is where I'm stuck.

In this example I could do something like sample1_wh_ssn = re.sub(re_results[0][0],'',sample1_numb) to get the desired result, but this will not generalize.

Any help would be greatly appreciated.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Avibel
  • 13
  • 2
  • It seems you only want to "support" numbers from `1` to `99` inclusively, right? Also, would the `hello my name is sofie my social security number is and I live on mountain street number 12` result suffice? Or do you want to rever numbers to word numerals? – Wiktor Stribiżew Apr 02 '20 at 13:15
  • @WiktorStribiżew Yes, number from `1` to `99` will be sufficient. It will not be perfect as people of cause can list their ssh by using greater numbers. Also the best for me would be to rever numbers to word numerals. – Avibel Apr 02 '20 at 18:48

1 Answers1

1

Here is an implementation of your current logic, namely:

  • Convert word numbers from 1 through 99 into numbers
  • Remove all instances of 3 or more numbers separated with whitespaces
  • Convert numbers two-digit numbers back to words.

Credits:

See Python code:

import re

number_words = [ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
number_words_tens =[ "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety" ]
number_words_rx = re.compile(r'\b(?:(?:{0})?(?:{1})|(?:{0}))\b'.format("|".join(number_words_tens),"|".join(number_words)))
main_rx = re.compile(r'\s*\d+(?:\s+(?:and\s+)?\d+){2,}')
numbers_1_99 = number_words
numbers_1_99.extend(tens if ones == "zero" else (tens + "-" + ones) # stackoverflow.com/a/8982279/3832970
    for tens in "twenty thirty forty fifty sixty seventy eighty ninety".split()
    for ones in numbers_1_99[0:10])

def text2int(textnum, numwords={}): # stackoverflow.com/a/493788/3832970
    units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
    ]
    tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
    numwords["and"] = (1, 0)
    for idx, word in enumerate(units):
        numwords[word] = (1, idx)
    for idx, word in enumerate(tens):
        numwords[word] = (1, idx * 10)
    current = result = 0
    for word in textnum.split():
        if word not in numwords:
          raise Exception("Illegal word: " + word)

        scale, increment = numwords[word]
        current = current + increment

    return result + current
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1 = number_words_rx.sub(lambda x: str(text2int(x.group())), sample1)
re_results = main_rx.sub('', sample1)
print( re.sub(r'\d{1,2}', lambda x: numbers_1_99[int(x.group())], re_results) )

Output: hello my name is sofie my social security number is and I live on mountain street number twelve

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563