5

I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format

<div id="post_message_somenumber">

and I only want to get the first one

I tried xpath='//div[starts-with(@id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions

Quintin Robinson
  • 81,193
  • 14
  • 123
  • 132
bigbucky
  • 61
  • 1
  • 1
  • 2
  • Good question, +1. See my answer for two possible causes of the problem and for solution. – Dimitre Novatchev Feb 01 '11 at 05:32
  • 2
    The problem is with quotes and (perhaps secondarily) the value of the `id` (it doesn't start with a double quote). You want something like `xpath='//div[starts-with(@id, "post_message_")]'` – salathe Feb 01 '11 at 07:46
  • I don't know what yql is, but I suspect the issue is with how you write an XPath expression containing quotes and then embed it or escape it in your host language environment. – Michael Kay Feb 01 '11 at 09:45
  • thanks for the responses. Salathe, your suggestion worked. YQL is yahoo query language and, along with yahoo pipes, is a good way for people who don't know programming to learn how to parse web pages, combine rss feeds, etc. – bigbucky Feb 01 '11 at 20:53

3 Answers3

6

I think I have a solution that does not require dealing with namespaces.

Here is one that selects all matching div's:

//div[@id[starts-with(.,"post_message")]]

But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:

(//div[@id[starts-with(.,"post_message")]])[1]

These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.

It works great for me in PowerShell:

# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'

# Run the xpath selection of all matching div's
$xml.selectnodes('//div[@id[starts-with(.,"post_message")]]')

Result:

id
--
post_message_somenumber
post_message_somenumber2

Or, for just the first match:

# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[@id[starts-with(.,"post_message")]])[1]')

Result:

id
--
post_message_somenumber
Vimes
  • 10,577
  • 17
  • 66
  • 86
5

I tried xpath='//div[starts-with(@id, '"post_message_')]' in yql without success I'm still learning this, anyone have suggestions

If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.

Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.

Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.

Here is a good answer how to do this in C#.

Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.

Community
  • 1
  • 1
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
1
@FindBy(xpath = "//div[starts-with(@id,'expiredUserDetails') and contains(text(), 'Details')]") 
private WebElementFacade ListOfExpiredUsersDetails;

This one gives a list of all elements on the page that share an ID of expiredUserDetails and also contains the text or the element Details

bofredo
  • 2,348
  • 6
  • 32
  • 51
jaxy
  • 37
  • 1