0

I want to parse three type of line from a file in python:

"Name" "Something to say !"
"Just a descriptive sentence"
name "Something to say !"

I want to get the name and the sentence and if there is no name just the setence. I read each line of the file use re to see if the regex match. It works pretty except for this one:

"Name" "Something to say !"

It just returns the whole thing instead of two parts.

Here is my regex :

r"(\"[a-zA-z?]*\"|[a-zA-z]*)\s\"(.+)\""
louisld
  • 323
  • 2
  • 15

2 Answers2

1

You might use a capture group for " with a backreference to either match or not match the accompanying double quote.

Then you can make the whole first part including the whitespace char optional, and match the second part between double quotes.

Note that [a-zA-z] matches more than [a-zA-Z] and the ? inside the character class matches the question mark literally.

The matches are in group 1 and group 3.

(?:(("?)[a-zA-Z]+\2)\s)?("[^"]+")
  • (?: Non capture group
    • ( Capture group 1
      • ("?) Capture an optional " in group 2
      • [a-zA-Z]+ Match a+ times a char a-zA-Z a
      • \2 A backreference to group 2 to match exactly what is matched in that group
    • )\s Close group 1 and match a whitespace char
  • )? Close the non capture group and make it optional
  • ("[^"]+") Capture group 3, match from " till "

See a regex demo | Python demo

Example using re.finditer looping the matches:

import re

regex = r"(?:((\"?)[a-zA-Z]+\2)\s)?(\"[^\"]+\")"
s = ("\"Name\" \"Something to say !\"\n"
            "\"Just a descriptive sentence\"\n"
            "name \"Something to say !\"\n"
            "\"Name\" \"Something to say !\"")

matches = re.finditer(regex, s)
for matchNum, match in enumerate(matches, start=1):
        print(f"Name: {match.group(1)} Sentence: {match.group(3)}")

Output

Name: "Name" Sentence: "Something to say !"
Name: None Sentence: "Just a descriptive sentence"
Name: name Sentence: "Something to say !"
Name: "Name" Sentence: "Something to say !"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

Solution

Your best option in my view is to use named capture groups. Here's how:

import re

lines = [
    '"Name" "Something to say !"',
    '"Just a descriptive sentence"',
    'name "Something to say !"'
    ]

p = re.compile(r"(\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\")")

for line in lines:
    m = p.search(line)
    print(m["part1"])
    print(m["part2"])

The output will be

Name
Something to say !
None
Just a descriptive sentence
name
Something to say !

Explanation

The regex (\"?(?P<part1>.+?)\"? )?(\"(?P<part2>.+)\") consists of two main parts. I'll go through the first one, (\"?(?P<part1>.+?)\"? )?. The second one is very similar.

  • An outer group (...)? with the "zero or more" quantifier ?. So in your second case, only the 'part2' capturing group will be active.
  • Inside this group, the quotes are also marked with the "zero or more" quantifier to cover your third case: \"?
  • The part (?P<part1>.+?) matches the text between the quotes and assigns the name "part1" for easy access.
    • . matches all symbols
    • +? matches one or more of the previous lazily (as many characters as needed, as few as feasible). This is needed to exclude the second quote from the match.

With this regex, you can access the content of the named capturing groups via square-bracket syntax, as shown in the code above.

Capturing the quotes

If you want to capture not only the text in quotes, but also the quotes themselves, simply move the \" inside the named capturing groups like so: ((?P<part1>\"?.+?\")? )?((?P<part2>\".+\"))

jobrachem
  • 105
  • 7
  • Thank you it works well. I just have another a little question. My lines start with 4 spaces, and in the part1 i get those spaces. How can I escape them ? – louisld Apr 29 '21 at 18:13
  • I got it. The best regex I found for my need is : `r"(\"?(?P[A-Za-z0-9?]+?)\"? )?(\"(?P.+)\")"` – louisld Apr 29 '21 at 18:25