1

I want to extract 2 lists of words that are connected by the sign =. The regex code works for separate lists but not in combination.

Example string: bla word1="word2" blabla abc="xyz" bla bla

One output shall contain the words directly left of =, i.e. word1, abc and the other output shall contain the words directly right of =, i.e. word2, xyz without quotes.

\w+(?==\"(?:(?!\").)*\") extracts the words left of =, i.e. word1,abc

=\"(?:(?!\").)*\" extracts the words right of = including quotes and =, i.e. ="word2",="xyz"

How can I combine these 2 queries to a single regex-expression that outputs 2 groups? Quotes and equal signs shall not be outputted.

3 Answers3

2

If you are looking for lhs and rhs from lhs="rhs" this should work (Sorry this what I understood from your question)

import re
test_str='abc="def" ghi'
ans=re.search("(\w+)=\"(\w+)\"",test_str)
print(ans.group(1))
print(ans.group(2))
my_list=list(ans.groups())
print(my_list)
ecsridhar
  • 111
  • 5
2

You can use

([^\s=]+)="([^"]*)"

See the regex demo. Details:

  • ([^\s=]+) - Group 1: one or more occurrences of a char other than whitespace and = char
  • =" - a =" substring
  • ([^"]*) - Group 1: zero or more chars other than " char
  • " - a " char.

Note: \w+ only matches one or more letters, digits and underscores, and won't match if the keys contain, say, hyphens. (?:(?!\").)* tempered greedy token is not efficient, and does not match line break chars. As the negative lookahead only contains a single char pattern (\.), it is more efficient to write it as a negated character class, [^.]*. It also matches line break chars. If you do not want that behavior, just add the \r\n into the negated character class.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

This should do what you want:

(?: (\w*)=)(?:\"(\w*)\")

This is for a python regex.

You can see it working here.

0xd34dc0de
  • 493
  • 4
  • 10