2

Assume this string:

[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text

I'd like to and up with a key, value pair like this:

Key      Value
aaa      some text here  
bbbb3    some other text here  
cc       more text

or a pandas DataFrame like this

aaa            | bbbb3                |cc
-------------------------------------------------
some text here | some other text here | more text
next line      | .....                | .....

I tried a regex like: r'\[(.{6})\]\s(.*?)\s\[' but this doesn't work.

cs95
  • 379,657
  • 97
  • 704
  • 746
John Doe
  • 9,843
  • 13
  • 42
  • 73

6 Answers6

3

Try this regex which captures your key and value in named group captures.

\[\s*(?P<key>\w+)+\s*]\s*(?P<value>[^[]*\s*)

Explanation:

  • \[ --> Since [ has a special meaning which defines character set, hence it needs to be escaped and it matches a literal [
  • \s* --> Consumes any preceding space before the intended key that doesn't need to part of key
  • (?P<key>\w+)+ --> Forms a key named group capturing one or more word [a-zA-Z0-9_] characters. I have used \w to keep it simple as the OP's string only contains alphanumeric characters, otherwise one should use [^]] character set to capture everything within square bracket as key.
  • \s* --> Consumes any following space after the intended key capture that doesn't need to part of key
  • ] --> Matches a literal ] which doesn't need escaping
  • \s* --> Consumes any preceding space that doesn't need to be part of value
  • (?P<value>[^[]*\s*) --> Forms a value named group capturing any character exception [ at which point it stops capturing and groups the captured value in named group value.

Demo

Python code,

import re
s = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

arr = re.findall(r'\[\s*(?P<key>\w+)+\s*]\s*(?P<value>[^[]*\s*)', s)
print(arr)

Outputs,

[('aaa', 'some text here '), ('bbbb3', 'some other text here '), ('cc', 'more text')]
Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
3

Use re.findall, and extract regions of interest into columns. You can then strip out spaces as necessary.

Since you mentioned you are open to reading this into a DataFrame, you can leave that job to pandas.

import re
matches = re.findall(r'\[(.*?)\](.*?)(?=\[|$)', text)

df = (pd.DataFrame(matches, columns=['Key', 'Value'])
        .apply(lambda x: x.str.strip()))

df
     Key                 Value
0    aaa        some text here
1  bbbb3  some other text here
2     cc             more text

Or (Re: edit),

df = (pd.DataFrame(matches, columns=['Key', 'Value'])
        .apply(lambda x: x.str.strip())
        .set_index('Key')
        .transpose())

Key               aaa                 bbbb3         cc
Value  some text here  some other text here  more text

The pattern matches the text inside braces, followed by the text outside upto the next opening brace.

\[      # Opening square brace 
(.*?)   # First capture group
\]      # Closing brace
(.*?)   # Second capture group
(?=     # Look-ahead 
   \[   # Next brace,
   |    # Or,
   $    # EOL
)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Many thanks, also for the regex explanation ! Is it possible to use the key as column name ? – John Doe Dec 19 '18 at 14:59
  • @JohnDoe Re:edit, `pd.DataFrame(matches, columns=['Key', 'Value']).apply(...).set_index('Key').T` – cs95 Dec 19 '18 at 15:11
  • This is not what I mean. I will try to explain it better. Now the columns are `Key`and `Value` and I would like to see `column=['aaa', 'bbbb3','cc']` from the first captured group and the values from the second captured group. The key value thing is confusing here. I just mentioned here to end up with a dictionary and potentialy end up with a DataFrame – John Doe Dec 19 '18 at 15:18
  • @JohnDoe Yes, figured it out. See my edit showing how to load it in that format. The columns are the keys once you transpose. – cs95 Dec 19 '18 at 15:21
  • Ok thanks. A bit confusing to me because I still saw the key value as indexes – John Doe Dec 19 '18 at 15:23
  • This doesn't work if i'm processing multiple lines. Only the last one is added to the DataFrame – John Doe Dec 19 '18 at 16:34
  • @JohnDoe Yes... because your question had to do with a single string only. Consider opening a new question as a follow up to this? – cs95 Dec 19 '18 at 16:42
1

You could minimize the regex needed by using re.split() and output to a dictionary. For example:

import re

text = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

# split text on "[" or "]" and slice off the first empty list item
items = re.split(r'[\[\]]', text)[1:]

# loop over consecutive pairs in the list to create a dict
d = {items[i].strip(): items[i+1].strip() for i in range(0, len(items) - 1, 2)}

print(d)
# {'aaa': 'some text here', 'bbbb3': 'some other text here', 'cc': 'more text'}
benvc
  • 14,448
  • 4
  • 33
  • 54
  • See the better `str.split()` based approach in the answer from @PatrickArtner (no regex needed). – benvc Dec 19 '18 at 15:02
1

Regex is not really needed here - simple string split does the job:

s = "[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text"    

parts = s.split("[")  # parts looks like: ['', 
                      #                    'aaa   ] some text here ',
                      #                    'bbbb3 ] some other text here ', 
                      #                    'cc    ] more text'] 
d = {}
# split parts further
for p in parts:
    if p.strip():
        key,value = p.split("]")            # split each part at ] and strip spaces
        d[key.strip()] = value.strip()      # put into dict

# Output:
form = "{:10} {}"
print( form.format("Key","Value"))

for i in d.items():
      print(form.format(*i))

Output:

Key        Value
cc         more text
aaa        some text here
bbbb3      some other text here

Doku for format'ing:


As almost 1-liner:

d = {hh[0].strip():hh[1].strip() for hh in (k.split("]") for k in s.split("[") if k)}  
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
0

You could use finditer:

import re

s = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

pattern = re.compile('\[(\S+?)\s+\]([\s\w]+)')
result = [(match.group(1).strip(), match.group(2).strip()) for match in pattern.finditer(s)]
print(result)

Output

[('aaa', 'some text here'), ('bbbb3', 'some other text here'), ('cc', 'more text')]
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
0

With a RegEx, you can find key,value pairs, store them in a dictionary, and print them out:

import re

mystr = "[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text"

a = dict(re.findall(r"\[([A-Za-z0-9_\s]+)\]([A-Za-z0-9_\s]+(?=\[|$))", mystr))

for key, value in a.items():
    print key, value

# OUTPUT: 
# aaa     some text here 
# cc      more text
# bbbb3   some other text here 

The RegEx matches 2 groups:
The first group is all the characters, numbers and spaces inside enclosed in squared brackets and the second is all the characters, numbers and spaces preceded by a closed square bracket and followed by an open square brackets or end of the line

First group: \[([A-Za-z0-9_\s]+)\]
Second group: ([A-Za-z0-9_\s]+(?=\[|$))

Note that in the second group we have a positive lookahead: (?=\[|$). Without the positive lookahead, the character would be consumed, and the next group won't find the starting square bracket.

findall returns then a list of tuple: [(key1,value1), (key2,value2), (key3,value3),...].
A list of tuple can be immediately converted into a dictionary: dict(my_tuple_list).

Once you have your dict, you can do what you want with your key/value pairs :)

Gsk
  • 2,929
  • 5
  • 22
  • 29