2

I am having trouble understanding the regular expressions module in Python. I think what I am trying to do is fairly simple, but I cannot figure it out.

I need to search through some xml files and find this pattern:

'DisplayName="Parcels (10-1-2012)"'

I can parse through the xml and make replacements no problem, the part I cannot figure out is how to do a wild card search to find any instance of "Parcels (some-date-year)". Since the date will vary, I need to find this pattern:

pat = '"Parcels (*-*-*)"' 

and I want to replace it with today's date which I can do with the time module. I copied out a line of one of the 80 or so xml docs where I would need to find the pattern.

According to the help for the re.search() function, it seems I can just put in a pattern, then the string I wish to search through. However, I am getting errors.

Help on function search in module re:

search(pattern, string, flags=0) Scan through string looking for a match to the pattern, returning a match object, or None if no match was found.

Here is my little test snippet:

import re
pat = '"Parcels (*-*-*)"'
t= '         <Layer DisplayName="Parcels (7-1-2010)" FeatureDescription="Owner Name: {OWNER_NAME}&lt;br/&gt;Property Address: {PROP_ADDR}&lt;br/&gt;Tax Name: {TAX_NAME}&lt;br/&gt;Tax Address 1: {TAX_ADD_L1}&lt;br/&gt;Tax Address 2: {TAX_ADD_L2}&lt;br/&gt;Land Use: {USE1_DESC}&lt;br/&gt;&lt;a href=&quot;http://www16.co.hennepin.mn.us/pins/pidresult.jsp?pid={PID_NO}&quot;&gt;View Property Information&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;" FeatureLabel="Parcel ID: {PID_NO}" IconUri="{RestVirtualDirectoryUrl}/Images/Parcel.png" Identifiable="true" IncludeInLayerList="true" IncludeInLegend="true" Name="Parcels" Searchable="true" ShowMapTips="true" UnconfiguredFieldsSearchable="true" UnconfiguredFieldsVisible="true" Visible="true">'
match = re.search(pat, t)
print match

Most of the line is junk I don't need to worry about. I just need to see how I can find that date in the line so I can use just that piece in the replace() function. Does anyone know how I could find these dates? There may be other dates in the xml somewhere, but I don't need to replace these; just where it says "Parcels (some-date-year)". I appreciate any help! Thanks!

crmackey
  • 349
  • 1
  • 5
  • 20

2 Answers2

1
import re

t= '         <Layer DisplayName="Parcels (7-1-2010)" FeatureDescription="Owner Name: {OWNER_NAME}&lt;br/&gt;Property Address: {PROP_ADDR}&lt;br/&gt;Tax Name: {TAX_NAME}&lt;br/&gt;Tax Address 1: {TAX_ADD_L1}&lt;br/&gt;Tax Address 2: {TAX_ADD_L2}&lt;br/&gt;Land Use: {USE1_DESC}&lt;br/&gt;&lt;a href=&quot;http://www16.co.hennepin.mn.us/pins/pidresult.jsp?pid={PID_NO}&quot;&gt;View Property Information&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;" FeatureLabel="Parcel ID: {PID_NO}" IconUri="{RestVirtualDirectoryUrl}/Images/Parcel.png" Identifiable="true" IncludeInLayerList="true" IncludeInLegend="true" Name="Parcels" Searchable="true" ShowMapTips="true" UnconfiguredFieldsSearchable="true" UnconfiguredFieldsVisible="true" Visible="true">'

You need to escape the parens and then you can be more specific as to the contents, the generic character is ., and the * means 0 or more:

pat = '"Parcels \(.*\)"'
match = re.search(pat, t)
print(match.group())

Which prints:

"Parcels (7-1-2010)"

a more specific pattern would be:

pat = '"Parcels \([0-9]+-[0-9]+-[0-9]+\)"'
match = re.search(pat, t)
print(match.group())

Which prints:

"Parcels (7-1-2010)"

Here, the bracket contents ([0-9]) unitarily describe all the numbers from 0 to 9 (\d would be equivalent), the plus, +, following them means more than 0, and the dash means itself.

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
  • Thanks Aaron! The re module's help is a little confusing. I need to do some more reading there. That did the trick though. – crmackey Jan 30 '14 at 16:27
  • Thanks for the second option as well. This makes more sense to me than the first, and would probably be better at flagging the numeric characters. Thanks again! – crmackey Jan 30 '14 at 16:29
  • 1
    I created a quick ref card based on the help with a small demo at the end. I should probably publish it. It basically lists the special characters, special sequences, module functions, flags, help on functions, (fairly complete descriptions) and a small Verbose example at the end of my own. – Russia Must Remove Putin Jan 30 '14 at 16:30
  • Please do! I would like to see it; it would probably be easier to understand than the online help. Please post the location if you choose to publish. – crmackey Jan 30 '14 at 17:00
1

Aaron's answer is good, just a little modification to match what it looks like you wanted (matched the data format specified)

import re

the_string = '<Layer DisplayName="Parcels (7-1-2010)" ... blablabla '
pattern = r'Parcels \(.*-.*-.*\)'
match = re.search(pattern, the_string)
print match.group()

Also, if you suspect the string may have more than 1 match, you could print all of the matches using the findall method. I've also used the \d+ regex, which matches only digits in the string

import re

the_string = '<Layer DisplayName="Parcels (7-1-2011)" ... blablabla ... Layer DisplayName="Parcels (7-1-2012)" '
pattern = r'Parcels \(\d+-\d+-\d+\)'
all_matches = re.findall(pattern, the_string)
for match in all_matches:
  print match
Quentin Donnellan
  • 2,687
  • 1
  • 18
  • 24