2

Let's say I have this string

[LEVEL]
    [NAME]The Girder Guide! [/NAME]
    [AUTHOR]draworigami[/AUTHOR]
    [AUTHORLEVEL]11[/AUTHORLEVEL]
    [COUNTRY]CA[/COUNTRY]
    [ID]62784[/ID]
    [RATING]4[/RATING]
    [DATE]2021-05-11 23:08:35[/DATE]
    [PLAYCOUNT]33[/PLAYCOUNT]
    [WINCOUNT]28[/WINCOUNT]
    [STARS]0[/STARS]
    [COMMENTS]1[/COMMENTS]
[/LEVEL]

Is there a way I can get the individual strings between each [] and [/]? I've kept taking shots at it with code from the internet to no avail.

Ashok Arora
  • 531
  • 1
  • 6
  • 17
Snackers
  • 23
  • 3
  • 1
    welcome to stackoverflow! please take the [tour](http://stackoverflow.com/tour), read up on [how to ask a question](https://stackoverflow.com/help/asking) and provide the [shortest program necessary to reproduce the problem](https://stackoverflow.com/help/minimal-reproducible-example). why the `rml` tag? – hiro protagonist May 14 '21 at 12:11
  • This looks like an XML-like recursive language, so you could parse it with a recursive-descent, LL(k) or LR(k) parser. Regexes won't work because they aren't powerful enough for this kind of language. – ForceBru May 14 '21 at 12:15
  • @hiroprotagonist It is in RDF Mapping Language (RML) formatting. – Snackers May 14 '21 at 12:28
  • 1
    [RML Mapping Language](https://rml.io/docs/rml/introduction/) looks different from what you included in your question. In the spec the square brackets are used to wrap recursive content, while in your format they are used to identify tags. Seems completely different. Please provide a reference for the format you are using, including specs on how certain characters are escaped. – trincot May 14 '21 at 12:32
  • @Snackers it really does not look like [this rml](https://rml.io/docs/rml/introduction/)... – hiro protagonist May 14 '21 at 12:38

3 Answers3

1

This will return all the text between [] and [/]:

from bs4 import BeautifulSoup

rml = """
[LEVEL]
    [NAME]The Girder Guide! [/NAME]
    [AUTHOR]draworigami[/AUTHOR]
    [AUTHORLEVEL]11[/AUTHORLEVEL]
    [COUNTRY]CA[/COUNTRY]
    [ID]62784[/ID]
    [RATING]4[/RATING]
    [DATE]2021-05-11 23:08:35[/DATE]
    [PLAYCOUNT]33[/PLAYCOUNT]
    [WINCOUNT]28[/WINCOUNT]
    [STARS]0[/STARS]
    [COMMENTS]1[/COMMENTS]
[/LEVEL]
"""

html = rml.replace('[', '<').replace(']', '>')
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('level').text)

Output:

The Girder Guide! 
draworigami
11
CA
62784
4
2021-05-11 23:08:35
33
28
0
1

Edit #1: The original string does not have newlines, so to pretty print:

rml = "[LEVEL][NAME]The Girder Guide![/NAME][AUTHOR]draworigami[/AUTHOR][AUTHORLEVEL]11[/AUTHORLEVEL][COUNTRY]CA[/COUNTRY][ID]62784[/ID][RATING]4[/RATING][DATE]2021-05-11 23:08:35[/DATE][PLAYCOUNT]33[/PLAYCOUNT][WINCOUNT]28[/WINCOUNT][STARS]0[/STARS][COMMENTS]1[/COMMENTS][/LEVEL]"

html = rml.replace('[', '<').replace(']', '>')
soup = BeautifulSoup(html, 'html.parser')
elements = soup.find('level').contents
for e in elements:
    print(e.text)
Ashok Arora
  • 531
  • 1
  • 6
  • 17
0

Try this:

st = "[LEVEL][NAME]The Girder Guide![/NAME][AUTHOR]draworigami[/AUTHOR][AUTHORLEVEL]11[/AUTHORLEVEL][COUNTRY]CA[/COUNTRY][ID]62784[/ID][RATING]4[/RATING][DATE]2021-05-11 23:08:35[/DATE][PLAYCOUNT]33[/PLAYCOUNT][WINCOUNT]28[/WINCOUNT][STARS]0[/STARS][COMMENTS]1[/COMMENTS][/LEVEL]"

st = st.split("]")
for i in range(len(st)):
    st[i] = st[i].replace("[", "")
    st[i]= st[i].replace("/", "")

st = st[:-1]

print(st)

The st becomes-

['LEVEL', 'NAME', 'The Girder Guide!NAME', 'AUTHOR', 'draworigamiAUTHOR', 'AUTHORLEVEL', '11AUTHORLEVEL', 'COUNTRY', 'CACOUNTRY', 'ID', '62784ID', 'RATING', '4RATING', 'DATE', '2021-05-11 23:08:35DATE', 'PLAYCOUNT', '33PLAYCOUNT', 'WINCOUNT', '28WINCOUNT', 'STARS', '0STARS', 'COMMENTS', '1COMMENTS', 'LEVEL']

What I did:

  • split the string around ] so a list of strings is obtained without the character ']'.
  • simply removed the characters [ and / individually from the strings in the list obtained.
  • skipped the last character because it was an empty string generated due to split.
edusanketdk
  • 602
  • 1
  • 6
  • 11
-1

How about using regular expression?

import re
s = '[LEVEL][NAME]The Girder Guide![/NAME][AUTHOR]draworigami[/AUTHOR][AUTHORLEVEL]11[/AUTHORLEVEL][COUNTRY]CA[/COUNTRY][ID]62784[/ID][RATING]4[/RATING][DATE]2021-05-11 23:08:35[/DATE][PLAYCOUNT]33[/PLAYCOUNT][WINCOUNT]28[/WINCOUNT][STARS]0[/STARS][COMMENTS]1[/COMMENTS][/LEVEL]'
s = s.replace('/', '')
result = []
for e in re.findall(r"\][A-Za-z0-9 _.:,!'/$\-]+\[", s):
    result.append(e.replace('[', '').replace(']', ''))

result

['The Girder Guide!',
 'draworigami',
 '11',
 'CA',
 '62784',
 '4',
 '2021-05-11 23:08:35',
 '33',
 '28',
 '0',
 '1']