0

I have a multiline string in python that looks like this

"""1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

I want to be able to group specific lines by the animals in python. So my output would look like

dog
1234 dog list some words 
1348 dog line 4
1678 dog line 5

cat
1432 cat line 2 
1789 cat line3 
1093 cat more words

fish
1733 fish line 6

So far I know that I need to split the text by each line

def parser(txt):
    for line in txt.splitlines():
        print(line)

But I'm not sure how to continue. How would I group each line with an animal?

stillearning
  • 393
  • 2
  • 15
  • 1
    Do you know the names of animals you want to group by beforehand? And do all lines start with a 4 digit number? – slider Oct 03 '19 at 20:14

3 Answers3

1

You could make use of defaultdict and splitting each lines:

from collections import defaultdict

txt = """123 dog foo
456 cat bar
1234 dog list some words
1348 dog line 4
1432 cat line 2 
1789 cat line3 
1093 cat more words
1678 dog line 5
"""


def parser(txt):
    result = defaultdict(list)
    for line in txt.splitlines():
        num, animal, _ = line.split(' ', 2)  # split the first 2 blancs, skip the rest!
        result[animal].append(line)  # add animal and the whole line into result
    return result

result = parser(txt)
for animal, lines in result.items():
    print('>>> %s' % animal)
    for line in lines:
        print(line)
    print("")

Output:

>>> dog
123 dog foo
1234 dog list some words
1348 dog line 4
1678 dog line 5

>>> cat
456 cat bar
1432 cat line 2 
1789 cat line3 
1093 cat more words
Maurice Meyer
  • 17,279
  • 4
  • 30
  • 47
  • what does`'>>> %s'` do? – matkv Oct 03 '19 at 20:25
  • 1
    Old school string formating: https://stackoverflow.com/questions/997797/what-does-s-mean-in-a-python-format-string – Maurice Meyer Oct 03 '19 at 20:27
  • It is used to print the header for each animal and make it stand out. For example `>>> dog`. The `%s` is the string replacement. You could change the `>>>` to `---` or remove it altogether. – sa_leinad Aug 02 '22 at 03:13
1
str1 = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

animals = ["dog", "cat", "fish"]
tmp = {}
tmp1= []
currentAnimal = ""
listOfWords = str1.split(" ")
for index, line in enumerate(listOfWords, start=1):
    if line in animals:
        currentAnimal = line
        if len(tmp1)>0:
            tmp1.pop()
            if currentAnimal not in tmp.keys():
                tmp[currentAnimal] = []
            tmp[currentAnimal].append(tmp1)
            tmp1=[]
        tmp1 = []
        tmp1.append(listOfWords[index-2])
        tmp1.append(listOfWords[index-1])
    else:
        tmp1.append(listOfWords[index-1])

for eachKey in tmp:
    print eachKey
    listOfStrings = tmp[eachKey]
    for eachItem in listOfStrings:
        if len(eachItem) > 0:
            print (" ").join(eachItem)

OUTPUT:

fish
1678 dog line 5
dog
1789 cat line3
1348 dog line 4
cat
1234 dog list some words
1432 cat line 2
1733 fish line 6
1

I know there are other answers, but I like mine better (hahaha).

Anyway, I parsed the original string as if the string has no \n (new line) characters.

To get the animals and the sentences, I used regular expressions:

import re

# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)

At this point, I get a list of tuples in groups:

>>> groups
[('1234 dog list some words ', 'dog'),
 ('1432 cat line 2 ', 'cat'),
 ('1789 cat line3 ', 'cat'),
 ('1348 dog line 4 ', 'dog'),
 ('1678 dog line 5 ', 'dog'),
 ('1733 fish line 6 ', 'fish'),
 ('1093 cat more words', 'cat')]

Then I would like to group all the sentences that refers to the same animal. That's why I created a data structure called hash table (a.k.a dictionary, in Python):

# create a dictionary to store the formatted data
dct = {}
for group in groups:
    if group[1] in dct:
        dct[group[1]].append(group[0])
    else:
        dct[group[1]] = [group[0]]

The dct dictionary looks like this:

>>> dct
{'dog': ['1234 dog list some words ', '1348 dog line 4 ', '1678 dog line 5 '],
 'cat': ['1432 cat line 2 ', '1789 cat line3 ', '1093 cat more words'],
 'fish': ['1733 fish line 6 ']}

Finally, we just have to print it in the format you want:

# then print the result in the format you like
for key, value in dct.items():
    print(key)
    for sentence in value:
        print(sentence)
    print()

And the output is:

dog
1234 dog list some words 
1348 dog line 4 
1678 dog line 5 

cat
1432 cat line 2 
1789 cat line3 
1093 cat more words

fish
1733 fish line 6 

The final code is the following:

import re

# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)

# create a dictionary to store the formatted data
dct = {}
for group in groups:
    if group[1] in dct:
        dct[group[1]].append(group[0])
    else:
        dct[group[1]] = [group[0]]

# then print the result in the format you like
for key, value in dct.items():
    print(key)
    for sentence in value:
        print(sentence)
    print()