-1

I'm trying to ensure that an expected list of substrings occur in a list of strings. I need to know if one is missing so I can populate it. I need to find the indices of a list of substrings in a list of strings so I can pull the values of the string next to it. (Using Python 3.)

# List of strings parsed from a document
strings = [['name', 'Joe Sixpack', 'email', 'beerme@thebrew.com'],
           ['name', 'Winnie Cooler', 'email', 'Winnie Cooler', 'phone', 
            '555-555-5550']]
# Expected/desired headings
subs = ['name', 'email', 'phone']

Then check if all 'subs' are captured. If not, find which ones and fill them in with nan.

Expected Results:

{'name': 'Joe Sixpack', 'email': 'beerme@thebrew.com', 'phone': nan}
{'name': 'Winnie Cooler', 'email': 'Winnie Cooler', 'phone': '555-555- 
 5550'}
spearna
  • 25
  • 9

4 Answers4

1

This question seems to be about how to translate the logical steps required to solve a problem into code. Before even starting with Python, it can be helpful to think in pseudocode to clearly see the logical steps required.

for each row of data:
    * initialize a new output data structure for this row
    for each required key:
        if the key is in the row:
            * find the indices associated with the key/value pair
            * store key/value pair in the output data
        otherwise (i.e. if the key is not in the row):
            * store key/None pair in the output data 

You can almost directly translate this pseudocode into working Python code. This is a very explicit approach using loops and variable declarations for each step of the logic, which is good as a learning exercise. Later on, you might want to optimize this for performance and/or style.

# List of strings parsed from a document
strings = [['name', 'Joe Sixpack', 'email', 'beerme@thebrew.com'],
           ['name', 'Winnie Cooler', 'email', 'Winnie Cooler', 'phone', 
            '555-555-5550']]

# Expected/desired headings
subs = ['name', 'email', 'phone']

# Create dictionaries for each row
results = []  
for row in strings:
    d = {}
    for key in subs:
        if key in row:
            key_idx = row.index(key)
            val_idx = key_idx + 1
            val = row[val_idx]
        else:
            val = None
        d[key] = val
    results.append(d)

print(results)

Results:

[{'name': 'Joe Sixpack', 'email': 'beerme@thebrew.com', 'phone': None}, 
{'name': 'Winnie Cooler', 'email': 'Winnie Cooler', 'phone': '555-555-5550'}]
Eric Miller
  • 110
  • 7
0
# List of strings parsed from a document
strings = [['name', 'Joe Sixpack', 'email', 'beerme@thebrew.com'],
           ['name', 'Winnie Cooler', 'email', 'Winnie Cooler', 'phone', 
            '555-555-5550']]
# Expected/desired headings
subs = ['name', 'email', 'phone']

I'll choose dictionary output using list comprehension for this.

for row in strings:
    # Get key:value of each sub in row
    foundSubs = dict((s,row[i+1]) for (i,s) in enumerate([n.lower() for n 
                     in row]) for sub in subs if sub in s)

# check for all subs in result: name, email, phone
#    if one missing, fill in nan
for eachSub in subs:
    if [i for i in foundSubs if eachSub in i] == []:
        foundSubs[eachSub] = np.nan

print (foundSubs)

Results:

{'name': 'Joe Sixpack', 'email': 'beerme@thebrew.com', 'phone': nan}
{'name': 'Winnie Cooler', 'email': 'Winnie Cooler', 'phone': '555-555- 
 5550'}

Can be made into list tuple format by not using the 'dict' in the list comprehension:

[('name', 'Joe Sixpack'), ('email', 'beerme@thebrew.com'), ('phone', nan)]
[('name', 'Winnie Cooler'), ('email', 'Winnie Cooler'), ('phone', '555-555- 
 5550')]
spearna
  • 25
  • 9
0

we will convert the list to a set and find the missing values: if we found one we will append the missing value and NONE in to list

# List of strings parsed from a document
    data = [['name', 'Joe Sixpack','email', 'Winnie Cooler'],
               ['name', 'Winnie Cooler', 'email', 'Winnie Cooler', 'phone', 
                '555-555-5550']]
    # Expected/desired headings
    subs = set(['name', 'email', 'phone'])

    for node in data:
        missingValue = subs.difference(set(node))
        if missingValue:
            for value in missingValue:
                node.append(value)
                node.append(None)
        print(node)

output

['name', 'Joe Sixpack', 'email', 'Winnie Cooler', 'phone', None]
['name', 'Winnie Cooler', 'email', 'Winnie Cooler', 'phone', '555-555-5550']
Venkata
  • 656
  • 6
  • 17
0

A one_liner:

>>> strings = [['name', 'Joe Sixpack', 'email', 'beerme@thebrew.com'],
...            ['name', 'Winnie Cooler', 'email', 'Winnie Cooler', 'phone', 
...             '555-555-5550']]
>>> subs = ['name', 'email', 'phone']
>>> [{**{k: None for k in subs}, **dict(zip(s[::2], s[1::2]))} for s in strings]
[{'name': 'Joe Sixpack', 'email': 'beerme@thebrew.com', 'phone': None}, {'name': 'Winnie Cooler', 'email': 'Winnie Cooler', 'phone': '555-555-5550'}]

Note: None is better than nan for a phone number.

The heart of the list comprehension is: dict(zip(s[::2], s[1::2])): s[::2] creates a list of the even elements of s, and s[1::2] a list of the odd elements. Both are zipped in an iterable (odd, even), (odd, even), ... that is ('name', 'Joe Sixpack'), ('email', 'beerme@thebrew.com') for the first string. They are wrapped in a dictionary with dict.

Now the default values. {k: None for k in subs} is a dictionary {'name': None, 'email': None, 'phone': None}. Both dictionary are merged (see How to merge two dictionaries in a single expression?) -- values of duplicates key are taken from the first one, and voila.

jferard
  • 7,835
  • 2
  • 22
  • 35