0


I have a html format file with all sorts of data, I need to extract from it certain pairs of (id, title). To do this I wrote an regEx that seems to work fine in regEx online tester.
File from where I need to extract data:

<g id="node841" class="cond_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_cond</title>
<g id="node842" class="prompt_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt</title>
<g id="edge841" class="edge"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_cond&#45;&gt;SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt</title>
<g id="node848" class="node"><title>SR_AUD_Main_link_51</title>
<g id="node841" class="prompt_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt</title>
<g id="node841" class="cmd_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_cmd</title>
<g id="node856" class="exit_node"><title>EXIT_63</title>
<g id="node860" class="node"><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_3</title>
<g id="node860" class="node"><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_4</title><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_3</title>

With this regEx:

(<g\sid="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|EXIT))

I am extracting entire lines with the above conditions.
The python script that uses the file and the regEx to extract those specific lines:

result = re.search(r'(id="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|EXIT))', svg)

But the problem is that result only contains 1 pair of data (only for node id 848) separated by "space char" not the entire list of lines that will be extracted with the regEx.

Do you have any idea how to extract all data that matches that regEx from the entire file, not only 1 line? In this particular case the extracted data should be, as the online regex tester says:

<g id="node848" class="node"><title>SR_AUD_Main_link_51</title>
<g id="node860" class="node"><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_3</title>
<g id="node860" class="node"><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_4</title>
Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
Lucian
  • 874
  • 11
  • 33
  • 5
    Why don't you look into BeautifulSoup ? – bad_keypoints Jun 26 '15 at 09:34
  • 2
    Obligatory: http://stackoverflow.com/a/1732454/3001761 – jonrsharpe Jun 26 '15 at 09:36
  • 1
    @jonrsharpe [and related](https://meta.stackoverflow.com/questions/261561/please-stop-linking-to-the-zalgo-anti-cthulhu-regex-rant) (c: – Peter Wood Jun 26 '15 at 09:37
  • Just let us know if you consider a non-regex solution, or are in some way forced to use regex. Note regex is not the tool you need for this task unless "you know what you are doing". – Wiktor Stribiżew Jun 26 '15 at 09:39
  • 1
    [Regex101.com says your regex is not capturing what you expect](https://regex101.com/r/vY8yU1/1). Please explain what you really need to extract and the exact requirements. `g` tag with class `node` with a `title` following it that has no `_cmd|_cond|_prompt|EXIT` "words" in the node value? – Wiktor Stribiżew Jun 26 '15 at 09:55
  • I wanted to extract all pairs of (id and title) from nodes that have the class="node" and title will NOT contain _cmd or _prompt or _cond or EXIT, so only the 3 ones mentioned – Lucian Jun 26 '15 at 10:28

2 Answers2

1
from bs4 import BeautifulSoup

html="""<g id="node841" class="cond_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_cond</title>
<g id="node842" class="prompt_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt</title>
<g id="edge841" class="edge"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_cond&#45;&gt;SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt</title>
<g id="node848" class="node"><title>SR_AUD_Main_link_51</title>
<g id="node841" class="prompt_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt</title>
<g id="node841" class="cmd_node"><title>SR_AUD_Nbest_List_PlaylistPlayPlaylist_cmd</title>
<g id="node856" class="exit_node"><title>EXIT_63</title>
<g id="node860" class="node"><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_3</title>
<g id="node860" class="node"><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_4</title><title>SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_3</title>"""


soup = BeautifulSoup(html)

a=soup.findAll("g")

b=[(i.get('id'),i.title.text) for i in a]

print b

output:

[('node841', u'SR_AUD_Nbest_List_PlaylistPlayPlaylist_cond'),
 ('node842', u'SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt'),
 ('edge841', u'SR_AUD_Nbest_List_PlaylistPlayPlaylist_cond->SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt'),
 ('node848', u'SR_AUD_Main_link_51'),
 ('node841', u'SR_AUD_Nbest_List_PlaylistPlayPlaylist_prompt'),
 ('node841', u'SR_AUD_Nbest_List_PlaylistPlayPlaylist_cmd'),
 ('node856', u'EXIT_63'),
 ('node860', u'SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_3'),
 ('node860', u'SR_AUD_ConfirmNAPlayPlaylistName_NotAvailable_4')]

1.Find all the tags

soup.findAll("g") 

2.You can access tag’s attributes by treating the tag like a dictionary.

`i.get('id')`

gives you desired id's and title.text gives you the text.

Peter Wood
  • 23,859
  • 5
  • 60
  • 99
Ajay
  • 5,267
  • 2
  • 23
  • 30
0

As mentioned in comments, Regular Expressions may not be the best tool to parse XML.

Having said that, the only problem with your approach seems to be that you are using search instead of findall or finditer, thus only returning the first match, instead of all.

p = r'(<g\sid="\w+"\s+class="node">+.{1,})(?!.+(_cmd|_cond|_prompt|EXIT))'
for match in re.finditer(p, svg):
    print match.group()

However, note that in the last case it will capture the entire line, not just the first <title>.

tobias_k
  • 81,265
  • 12
  • 120
  • 179