I know there are other answers, but I like mine better (hahaha).
Anyway, I parsed the original string as if the string has no \n
(new line) characters.
To get the animals and the sentences, I used regular expressions:
import re
# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""
# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)
At this point, I get a list of tuples in groups
:
>>> groups
[('1234 dog list some words ', 'dog'),
('1432 cat line 2 ', 'cat'),
('1789 cat line3 ', 'cat'),
('1348 dog line 4 ', 'dog'),
('1678 dog line 5 ', 'dog'),
('1733 fish line 6 ', 'fish'),
('1093 cat more words', 'cat')]
Then I would like to group all the sentences that refers to the same animal. That's why I created a data structure called hash table (a.k.a dictionary, in Python):
# create a dictionary to store the formatted data
dct = {}
for group in groups:
if group[1] in dct:
dct[group[1]].append(group[0])
else:
dct[group[1]] = [group[0]]
The dct
dictionary looks like this:
>>> dct
{'dog': ['1234 dog list some words ', '1348 dog line 4 ', '1678 dog line 5 '],
'cat': ['1432 cat line 2 ', '1789 cat line3 ', '1093 cat more words'],
'fish': ['1733 fish line 6 ']}
Finally, we just have to print it in the format you want:
# then print the result in the format you like
for key, value in dct.items():
print(key)
for sentence in value:
print(sentence)
print()
And the output is:
dog
1234 dog list some words
1348 dog line 4
1678 dog line 5
cat
1432 cat line 2
1789 cat line3
1093 cat more words
fish
1733 fish line 6
The final code is the following:
import re
# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""
# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)
# create a dictionary to store the formatted data
dct = {}
for group in groups:
if group[1] in dct:
dct[group[1]].append(group[0])
else:
dct[group[1]] = [group[0]]
# then print the result in the format you like
for key, value in dct.items():
print(key)
for sentence in value:
print(sentence)
print()