0

Trying to solve a problem which I know I can solve through iterating through the string but with python I'm sure there is a regex expression that would solve it more elegantly... it feels like giving up resorting to an iterative process!

Basically I have a list in a single cell of properties and I need to work out which properties are subproperties and which ones are subsubproperties and match them to the property that they are under. For example:

ID=11669 Antam Laterite Nickel/Ferronickel Operation  
     ID=19807 Gebe Laterite Nickel Mine
     ID=19808 Gee Island Laterite Nickel Mine
     ID=18923 Mornopo Laterite Nickel Mine
     ID=29411 Pomalaa Ferronickel Smelter
     ID=19806 Pomalaa Laterite Nickel Mine
          ID=29412 Maniang Laterite Nickel Project
     ID=11665 Southeast Sulawesi Laterite Nickel Project
          ID=27877 Bahubulu Laterite Nickel Deposit

Should generate:

MasterProp,    SubProp
11669,          19807
11669,          19808
11669,          18923
11669,          29411
11669,          19806
19806,          29412
11669,          11665
11665,          27877

Getting the 11669 and the second level is easy - just grab the first ID I find and then add to all the rest. But getting the "3rd level" is a lot harder

I tried the following

tags = re.compile('ID=(\d+).+(\&nbsp\;){8}')                        
for tag, space in tags.findall(str(cell)): 
    print tag

But that gives me the first ID that is before 8 spaces rather than the last ID before 8 spaces... so in the example above I get 11669 rather than 19806. I suspect there is an expression I can put in that says find an ID=(\d+) where there are no other ID=(\d+) between it and the 8 spaces, but that has proven beyond my (novice) capabilities! Any help would be welcomed...

user1487861
  • 420
  • 6
  • 16
  • 3
    You should not use regex for HTML/XML parsing! – Ωmega Jun 28 '12 at 12:14
  • http://stackoverflow.com/a/1732454/1350899 – mata Jun 28 '12 at 12:27
  • Thanks guys - I understand your frustration with people using regex to parse HTML but the thing is that I'm not really, I have used beautiful soup to get to cell and then I'm trying to use regex to parse the text in the cell because its not formatted as HTML in the cell, its formatted as indented text - effectively a bunch of nbsp's. I probably should have got rid of all of the HTML tags to simplify the problem and avoid the confusion as they are really just background noise... – user1487861 Jun 28 '12 at 12:42
  • Why dont you use LXML for parsing HTML? – 18bytes Jun 28 '12 at 12:52
  • I think the editing that was done after I posted this confused the issue because it showed all of the HTML tags. I have re-edited the question to remove all HTML tags which will hopefully get rid of that confusion. – user1487861 Jun 28 '12 at 22:44

2 Answers2

1

After using BS to get your tags, you want to be doing:

>>> from urlparse import urlparse, parse_qs
>>> myurl = 'ShowProp.asp?LL=PS&ID=19807'
>>> parse_qs(urlparse(myurl).query)
{'LL': ['PS'], 'ID': ['19807']}
>>> parse_qs(urlparse(myurl).query)['ID']
['19807']
>>> 
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thanks Jon - I can get the ID's without a problem. I think the confusion came because my original formatting was edited to show the HTML tags. I have redone the question to clear up the confusion. – user1487861 Jun 28 '12 at 22:37
0

I think the example code with the HTML in place made a lot more sense - actual data, instead of hand-waving.

bs = BeautifulSoup.BeautifulSoup(html)

parent_stack = [None]
res = []
for span in bs.findAll('span', {'style':'white-space:nowrap;display:inline-block'}):
    indent = 1 + span.previousSibling.count(' ') / 5
    id = int(span.find('input')['value'])
    name = span.find('a').text.strip()

    # warning! this assumes that indent-level only ever
    #   increases by 1 level at a time!
    parent_stack = parent_stack[:indent] + [id]
    res.append(parent_stack[-2:])

results in

[[None, 11669],
 [11669, 19807],
 [11669, 19808],
 [11669, 18923],
 [11669, 29411],
 [11669, 19806],
 [19806, 29412],
 [11669, 11665],
 [11665, 27877],
 [11665, 50713],
 [11665, 27879],
 [11665, 27878],
 [11669, 11394]]
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99