0

I've got this pattern based on which I want to search a string to find all matches. After using findall(), only the last one matched is printed.

The string which I want to process is below:

'<inventor sequence="001" designation="us-only"><addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="002" designation="us-only"><addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="003" designation="us-only"><addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="004" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor><inventor sequence="005" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor>'

I try to use the following code to extract all inventors from the string.

INVENTORS_CONTENT_PATTERN = re.compile('<inventor sequence=".*" designation=".*">(.*?)</inventor>')

re.findall(INVENTORS_CONTENT_PATTERN, data)

The result I get is the last one matched, not all the inventors from data:

['<addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>']
martineau
  • 119,623
  • 25
  • 170
  • 301
Barney_su
  • 43
  • 6

1 Answers1

0

This expression might be closer to what you have in mind:

<inventor sequence="[^"]*" designation="[^"]*">(.*?)<\/inventor>

Test

import re

regex = r'<inventor sequence="[^"]*" designation="[^"]*">(.*?)<\/inventor>'
test_str = """
<inventor sequence="001" designation="us-only"><addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="002" designation="us-only"><addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="003" designation="us-only"><addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="004" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor><inventor sequence="005" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor>

"""
print(re.findall(regex, test_str))

Output

['<addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>', '<addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Emma
  • 27,428
  • 11
  • 44
  • 69