You can use a regex that finds the first word after 'symptoms'
with optionally more matches that start with a komma, mabye spaces and more wordcharacters:
import re
pattern = r"symptoms\s+(\w+)(?:,\s*(\w+))*"
regex = re.compile(pattern)
t = "kathy has symptoms cold,cough her gender is female. john's symptoms hunger, thirst."
symptoms = regex.findall(t)
print(symptoms)
Output:
[('cold', 'cough'), ('hunger', 'thirst')]
Explanation:
r"symptoms\s+(\w+)(?:,\s*(\w+))*"
# symptoms\s+ literal symptoms followed by 1+ whitepsaces
# (\w+) followed by 1+ word-chars (first symptom) as group 1
# (?:, )* non grouping optional matches of comma+spaces
# (\w+) 1+ word-chars (2nd,..,n-th symptom) as group 2-n
Alternate way:
import re
pattern = r"symptoms\s+(\w+(?:,\s*\w+)*(?:\s+and\s+\w+)?)"
regex = re.compile(pattern)
t1 = "kathy has symptoms cold,cough,fever and noseitch her gender is female. "
t2 = "john's symptoms hunger, thirst."
symptoms = regex.findall(t1+t2)
print(symptoms)
Output:
['cold,cough,fever and noseitch', 'hunger, thirst']
This works for "british" english only - the amerikan way of
"kathy has symptoms cold,cough,fever, and noseitch"
will only lead to cold,cough,fever, and
as match.
You can split each individual match at ','
and " and "
to get your single reasons:
sym = [ inner.split(",") for inner in (x.replace(" and ",",") for x in symptoms)]
print(sym)
Output:
[['cold', 'cough', 'fever', 'noseitch'], ['hunger', ' thirst']]