Should I use Xpath or regexp for this?

Question

I'm no expert at languages or have any knowledge of it. I'm pulling data from a website that is half dynamic.

For example I need to have 2 columns for "Advising on a home purchase plan - Customer Type" and "Advising on a home purchase plan - Investment Type" which would list types of customers and investments (can be several of each) they can go into one cell but have some sort of divider such as ";".

Here is what the table appears like

How the table appears

Here is what the code appears like:

Advising on a home purchase plan

                <div id="a2Nb000000035ohEAA" class="collapse DisciplineDetails PassportDetails PermDesc">
                  <h3 class="PermissionsListHeader">Advising on a home purchase plan</h3>
                  <br>
                  <br>
                </div>

                <ul class="PermissionConditionsList">
                  <li class="PermissionsConditionsItem">
                    Customer Type 

                    <ul class="PermCondsLimitationsList">
                      <li style="list-style: none"><span id="j_id0:j_id1:j_id110:regActTable:0:j_id531:0:j_id533:0:j_id535:0:j_id538"></span></li>

                      <li class="PermCondsLimitationsItem Popover">Customer</li>
                    </ul>
                  </li>
                </ul>

                <ul class="PermissionConditionsList">
                  <li class="PermissionsConditionsItem">
                    Investment Type 

                    <ul class="PermCondsLimitationsList">
                      <li style="list-style: none"><span id="j_id0:j_id1:j_id110:regActTable:0:j_id531:1:j_id533:0:j_id535:0:j_id538"></span></li>

                      <li class="PermCondsLimitationsItem Popover">Home purchase plans</li>
                    </ul>
                  </li>
                </ul>
              </div>

before embarking on using RegExp, please say hello to [tony the pony](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) ... use xpath, or simply [querySelector](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector) and [querySelectorAll](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll) — Jaromanda X, Aug 11 '16 at 09:34
Thank you for your help Jaromanda X, is there any chance you could code such xpath for this task? I'm stuggling to find any simple information on this, as it goes way too technical for me.. — Tomas, Aug 11 '16 at 09:59
nice of you to offer but, the reason I say no is because I'm not terribly proficient at this sort of task — Jaromanda X, Aug 11 '16 at 10:22
We try to discourage job offers here, and outright requests to "code it for me". However we do encourage self-attempts where possible. If you really need someone to do everything for you, then Stack Overflow is probably not an appropriate venue - maybe Reddit would be better? — halfer, Aug 11 '16 at 20:18
Thank you for your comment, because of stackflow community im starting to get with grips of coding. My future questions, will be of technical support rather than requests :) — Tomas, Aug 12 '16 at 11:38

LukStorms · Accepted Answer · 2016-08-12T09:29:53.507

2

This xpath works if there are no other lists that have those classes but shouldn't be taking in account.

//ul[@class='PermCondsLimitationsList']/li[@class='PermCondsLimitationsItem Popover']/(text()|span/text()))[normalize-space(.)]

Tested here

To just get the titles:

//ul[@class='PermissionConditionsList']/li[@class='PermissionsConditionsItem']/text()[normalize-space(.)]

Combined:

//ul[@class='PermissionConditionsList']/li[@class='PermissionsConditionsItem']/(text()|ul[@class='PermCondsLimitationsList']/li[@class='PermCondsLimitationsItem Popover']/(text()|span/text()))[normalize-space(.)]

But to get both in a certain format, an XSLT would probably be more useful.

edited Aug 12 '16 at 09:29

answered Aug 11 '16 at 10:31

LukStorms

28,916
5
31
45

Hi Luke, thank you so much for your help! You are the first person to have actually given me something thats similar to what im after! I am willing to pay you for a meal or a coffee if you help me build this xpath correctly.. basically this is the webpage in extracting.. go to permissions and those are the tables i need xpaths to extract 3 groups of information from each, Customer type, investment type, and limitations. I need it to match the name of the table, then extract those 3 groups of info from each. Is this something you could do? – Tomas Aug 11 '16 at 11:20
In general, people on this site are rewarded with the reputation points gained from upvotes/approvals etc. Coffee and meals have not (yet?) been implemented on stackexchange. And I've seen more than once someone mentioning that this isn't a codewriting service. And I have no interest of dabbling with that "you don't have to code" service named Import.io. I advice that you take a good look at the xml, and experiment with XPATH yourself. For example via that site I linked in my answer. That learning experience is worth the time. – LukStorms Aug 11 '16 at 11:34
I just offered as a courtesy.. don't have to take it. I realise this site is more for technical questions and not requests but as i have no knowledge in this area, im clueless as to where even start. I'm hoping i will find someone who would give up 5 min of their time to write the code and save me endless hours of researching leanring a new area which im totally green at. – Tomas Aug 11 '16 at 11:41
Btw, if you're specifically looking for help with import.io then perhaps you should add that tag to your question. People on this site often bookmark the tags they are most interested or most expert in. – LukStorms Aug 11 '16 at 11:44
Luke any chance you could look into this for me? – Tomas Aug 11 '16 at 13:35
Added an xpath that combines both – LukStorms Aug 12 '16 at 09:34

score 0 · Answer 2 · answered Aug 11 '16 at 20:24

0

If you have chrome, you can view the xpath of an element by right clicking on the desired area and going to -> Inspect. The relevant part of the source code will be highlighted. From there you can get the xpath by right clicking the highlight code and going to Copy -> Copy XPath.

answered Aug 11 '16 at 20:24

wizardzz

261
1
2
8

Thank you, it works well with static sites, but the one im working on is half dynamic meaning the xpaths using divisions messes up when there is a change and ends up pulling wrong info. – Tomas Aug 12 '16 at 11:40
Ah, ok. Yes then you are dependent on the class ids. I'm not really familiar with import.io. I scrape and clean data for my job inhouse, I've used Jsoup and HTMLAgility pack. I could help you with those syntaxes. – wizardzz Aug 12 '16 at 15:59

Should I use Xpath or regexp for this?

2 Answers2