XPath selector that can handle variable structures

Question

I have some text I need to extract using XPath selectors. The text can be in 3 different forms:

<td>
    TARGET_TEXT
</td>

<td>
    <p>
        TARGET_TEXT
    </p>
</td>

<td>
    <p>
        <strong>TARGET_TEXT</strong>
    </p>
</td>

Is there an XPath statement/selector I can use that will handle all 3 of these scenarios? Or is it possible to add OR statements in an XPath selector?

for tr in table_rows:
    # only handles case 1
    topic_name = tr.xpath('.//td[1]/text()').extract()[0]

Hey, Jake, how about [**accepting**](http://meta.stackoverflow.com/q/5234/234215) some of the fine answers you've gotten in the past. You've asked 18 questions since August and accepted 0. Something's wrong there. — kjhughes, Nov 18 '16 at 03:45

score 1 · Answer 1 · edited May 23 '17 at 10:30

1

This XPath,

normalize-space(/td)

will return the same space-normalized string value of /td,

TARGET_TEXT

for all three of your cases.

For more information on string values in XPath, see Testing text() nodes vs string values in XPath.

edited May 23 '17 at 10:30

Community

1
1

answered Nov 18 '16 at 03:17

kjhughes

106,133
27
181
240

score 0 · Answer 2 · answered Nov 18 '16 at 05:23

0

for tr in table_rows:

    all_three = tr.xpath('.//td//text()').extract()

answered Nov 18 '16 at 05:23

宏杰李

11,820
2
28
35

score -1 · Answer 3 · answered Nov 18 '16 at 03:18

-1

Looks like the following is adequate:

for tr in table_rows:
    topic_name = tr.xpath('.//td[1]//text()').extract()
    # topic_name can be ['\r\n', 'TARGET_TEXT', '\r\n']
    topic_name = ''.join(topic_name)

answered Nov 18 '16 at 03:18

sazr

24,984
66
194
362

XPath selector that can handle variable structures

3 Answers3