Iterate over list of strings to pull out substrings

Question

I have a long list of different strings that all contain some information about a specific port across the globe. However, each port name is different and is contained in a different location within the string. What I want to do is loop over all of the strings, find the word 'Port' and then store the next two substrings after 'Port'. For example:

'Strong winds may disrupt operations at the Port of Rotterdam on July 5'

I find 'Port' and now want 'of Rotterdam' to be added onto 'Port' as a complete string, like 'Port of Rotterdam'. I thought there could be some way to split up each longer string by doing parts = my_str.split(' '). Then:

for i in parts:
    if i == 'Port':
        new_str = i

However, I am not sure how to add on the next two substrings. Ideas?

No, not necessarily. But I think most of the time it should be. I'm not sure how I would control for situations where it isn't though. — Eli Turasky, Jan 08 '21 at 18:22
You could also match Port followed by a single "word" and optionally a second for example `\bPort\s+\S+(?:\s+\S+)?` https://regex101.com/r/OatSlU/1 — The fourth bird, Jan 08 '21 at 18:26
I've updated my answer further, with a better regex solution as well. — Mad Physicist, Jan 08 '21 at 19:02

Mad Physicist · Accepted Answer · 2021-01-08T20:03:23.680

Take a look at list.index (also documented here):

parts = my_str.split(' ')
try:
    port_index = parts.index('Port')
except ValueError:
    pass # Port name not found
else:
    port_name = ' '.join(parts[port_index:port_index + 2])

You can of course do more advanced processing. For example, grab a sequence of uppercased words optionally preceded by a single of:

def find_name(sentence):
    """
    Get the port name or None.
    """
    parts = sentence.split(' ')
    try:
        start = parts.index('Port')
    except ValueError:
        return None
    else:
        if start == len(parts) - 1:
            return None

    end = start + 1
    if parts[end] == 'of':
        end = end + 1
    while end < len(parts) and parts[end][0].isupper():
        end += 1

    if end == start + 1 or (end == start + 2 and parts[start + 1] == 'of'):
        return None

    return ' '.join(parts[start:end])

Of course you can do the same thing with regex:

pattern = re.compile(r'Port(?:\s+of)?(\s+[A-Z]\S+)+')
match = pattern.search(my_str)
print(match.group())

This regex will not properly match non-latin uppercase letters. You may want to investigate the solutions here for sufficiently foreign port names.

Both of the solutions here will work correctly for the following two test cases:

'Strong winds may disrupt operations at the Port of Rotterdam on July 5'
'Strong winds may disrupt operations at the Port of Fos-sur-Mer on July 5'
'Strong winds may disrupt operations at Port Said on July 5'

You can likely improve the search further, but this should give you the tools to get a solid start. At some point, if the sentences become complex enough, you may want to use natural language processing of some kind. For example, look into the nltk package.

Your last regex solution is really neat. Is there a way to amend it so the name of the port can include dashes? For example, Port of Fos-sur-Mer. — Eli Turasky, Jan 08 '21 at 19:09
@EliTurasky. Sure. I've edited the answer. You can add builtin classes into manually constructed classes: `[\w-]` is totally valid. I've chosen to be even less restrictive and just changed `\w` to `\S`. — Mad Physicist, Jan 08 '21 at 20:03

score 1 · Answer 2 · answered Jan 08 '21 at 18:28

You can use list comprehension to get the 2 next tokens -

l = 'Strong winds may disrupt operations at the Port of Rotterdam on July 5 and Port of London is closed tomorrow'

tokens = l.split()
ports = [' '.join(tokens[i:i+3]) for i in range(len(tokens)) if tokens[i]=='Port']
print(ports)

['Port of Rotterdam', 'Port of London']

The benefit of this approach is that it can find multiple Ports in the same sentence.

score 1 · Answer 3 · answered Jan 08 '21 at 18:31

1

Another option is to use a pattern to match Port followed by a "word" that consists of non whitespace charactes and optionally a second word if the second is not always present.

\bPort\s+\S+(?:\s+\S+)?

\bPort Match Port preceded by a word boundary
\s+\S+ Match 1+ whitespace characters and 1+non whitespace characters
(?:\s+\S+)? Optionally match a second word

Regex demo

Example code

import re
pattern = r"\bPort\s+\S+(?:\s+\S+)?"
s = "Strong winds may disrupt operations at the Port of Rotterdam on July 5"
print(re.findall(pattern, s))

Output

['Port of Rotterdam']

answered Jan 08 '21 at 18:31

The fourth bird

154,723
16
55
70

This won't work for a sentence like `"Strong winds may disrupt operations at Port Said on July 5"`. Not a requirement by OP, but still a reasonable nice to have. – Mad Physicist Jan 08 '21 at 18:53
@MadPhysicist Why would it not work? https://regex101.com/r/tQ1Lcm/1 The question states `store the next two substrings after 'Port. ` so it does work. – The fourth bird Jan 08 '21 at 19:11

score 1 · Answer 4 · answered Jan 08 '21 at 18:33

.split() creates a list with every item being a new word of the list. Then iterate through the list and find the position with "Port". If a Port has been found, a new string is created.

parts = 'Strong winds may disrupt operations at the Port of Rotterdam on July 5'
words = parts.split()
new_str = None

for i, word in enumerate(words):
    if word == "Port":
        new_str = words[i + 1] + " " + words[i + 2]

if new_str:
    print(new_str)

Iterate over list of strings to pull out substrings

4 Answers4