0

I'm pulling some licensure data and placing it into a list.

rank = ['\r\n\t\t', 'RANK2', 'Rank II', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']
cert = ['\r\n\t\t', 'KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']

I want to remove the unicode characters and non-ascii characters from my lists and ultimately get my lists to look like this:

rank = ['RANK2', 'Rank II', '07-01-2016', '06-30-2021']
cert = ['KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07-01-2016', '06-30-2021']

I've looked through some other questions that remove escape sequences from lists, remove unicode, remove non-ascii, and some others but I can't get them to work for my situation.

Some get close but no cigar:

[word for word in cert if word.isalnum()]
>>> ['KEL', '07', '01', '2016', '06', '30', '2021']

def recursive_map(lst, fn):
    return [recursive_map(x, fn) if isinstance(x, list) else fn(x) for x in lst]
recursive_map(rank, lambda x: x.encode("ascii", "ignore"))
>>>['\r\n\t\t', 'RANK2', 'Rank II', '07', '-', '01', '-', '2016', '', '06', '-', '30', '-', '2021', '', '\r\n\t']    

I'm stuck in a rut at the moment...anyone have any ideas?

Community
  • 1
  • 1
otteheng
  • 594
  • 1
  • 9
  • 27
  • How are you getting `rank` and `cert`? If you're scraping the HTML page you're probably better off using `beautifulsoup` or a similar library, which has built-in ways to get you all the text in a table cell. – roeland Sep 08 '16 at 23:55

1 Answers1

1

Here's something quick-n-dirty:

rank = ['\r\n\t\t', 'RANK2', 'Rank II', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']
cert = ['\r\n\t\t', 'KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07', '-', '01', '-', '2016', u'\xa0', '06', '-', '30', '-', '2021', u'\xa0', '\r\n\t']

def clean(L):
    '''Removes non-printable characters and filters result for empty strings.
    '''
    cleaned = [scrubbed(x) for x in L if scrubbed(x)]
    # I use a character not in the ASCII range to rejoin the hyphenated dates.
    return '\xa0'.join(cleaned).replace('\xa0-\xa0','-').split('\xa0')

def scrubbed(s):
    '''Removed control and non-ASCII characters.
    '''
    return ''.join([n for n in s if 32 <= ord(n) <= 127])

print(clean(rank))
print(clean(cert))

Output:

['RANK2', 'Rank II', '07-01-2016', '06-30-2021']
['KEL', 'Professional Certificate For Teaching In Elementary School, Primary Through Grade 5', '07-01-2016', '06-30-2021']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251