3

I am trying to select a div with a class attribute that contains multiple spaces and new line. Here is a fragment below. I want to select all div with test-one and topit of what it looks like:

<div class="test-one
                    topit
        ">


        <div class='test-one a'>1
        </div>
        <div class='topit'>2
        </div>
</div>

<div class="test-one
                    topit
        ">


        <div class='test-one a'>1
        </div>
        <div class='topit'>2
        </div>
</div>

Here is what i have tried:

"//div[contains(concat(' ', normalize-space(@class), ' '), ' topranks ') and contains(concat(' ', normalize-space(@class), ' ), ' list-node ')]"

and

//*[contains(concat(' ', normalize-space(@class), ' '), ' atag ')]

Sources i have tried to improve on:

XPath - How to select by @text that contains new line

and

How can I match on an attribute that contains a certain string?

Community
  • 1
  • 1
Jide Koso
  • 415
  • 5
  • 17
  • 1
    The XPath expression you provided works: `//div[contains(concat(' ', normalize-space(@class), ' '), ' topit ') and contains(concat(' ', normalize-space(@class), ' '), ' test-one ')]` (of course, the one you gave had `topranks` instead of `topit` and `list-node` instead of `test-one`, but I'm guessing you changed them when testing - if you didn't than, there you go). – acdcjunior Aug 05 '15 at 17:18
  • @unutbu and @acdcjunior thanks, it seems like it would work but it doesn't on actual site. Original **css class** name is `list-node` and `topranks`. Here is the link [link] http://www.made-in-china.com/companysearch.do?xcase=hunt&order=0&style=b&page=1&word=bag&size=30&sizeHasChanged=0&memberLevel=blank&sgsMembershipFlag=&comProvince=nolimit&comCity=&cateCode=&comBusinessType=blank&numEmployees=&annualRevenue=&code=0&managementCertification= – Jide Koso Aug 05 '15 at 17:23
  • Jide, what is your question? One could infer a question like "What I tried didn't work. How can I make it work?" If that's your question, tell us what actually happened when you tried the XPath expression you showed. Was there an error? Did it select nothing? Too many things? The wrong thing? How do you know what the result was? – LarsH Aug 05 '15 at 17:34
  • 1
    @JideKoso I insist. The XPath you gave works. I just tried in the website you provided and it works. Well, I had to add a missing quote, but after that, it worked. Try it: `//div[contains(concat(' ', normalize-space(@class), ' '), ' topranks ') and contains(concat(' ', normalize-space(@class), ' '), ' list-node ')]` -- notice I closed the quote in `' ), ' list-node ')]` as you didn't have it. – acdcjunior Aug 05 '15 at 19:24

1 Answers1

2

cssselect

cssselect.GenericTranslator().css_to_xpath('div.test-one.topit')
# "descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' test-one ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' topit '))]"
tree = lxml.html.parse('http://www.made-in-china.com/companysearch.do?xcase=hunt&order=0&style=b&page=1&word=bag&size=30&sizeHasChanged=0&memberLevel=blank&sgsMembershipFlag=&comProvince=nolimit&comCity=&cateCode=&comBusinessType=blank&numEmployees=&annualRevenue=&code=0&managementCertification=').getroot()

tree.cssselect('div.list-node.topranks')
# [<Element div at 0x7f62e732dd18>, <Element div at 0x7f62e72d1f48>, <Element div at 0x7f62e72eb188>, <Element div at 0x7f62e72eb0e8>, <Element div at 0x7f62e72eb138>, <Element div at 0x7f62e72eb1d8>, <Element div at 0x7f62e72eb228>, <Element div at 0x7f62e72eb278>, <Element div at 0x7f62e72eb2c8>, <Element div at 0x7f62e72eb318>]
Oleh Prypin
  • 33,184
  • 10
  • 89
  • 99
  • This is a good way to solve the problem (if there is a place for Python in the toolchain), but it really just shows that the OP's XPath expression is correct, and doesn't explain why it isn't working. To put it another way, if the OP's XPath expression isn't working for him, why would this solution do any better? – LarsH Aug 05 '15 at 17:42
  • @LarsH The expression used here is different. Maybe it will actually work. Also, OP linked to this question in a chat about Python and stated that they use lxml. – Oleh Prypin Aug 05 '15 at 18:07
  • The only difference I see is the test for `@class and`, which is already known to be true for any div that would satisfy the rest of the predicate, and so won't make a difference in his case. `descendant-or-self::` is the expansion of `//`. Is there another difference? Besides added parentheses that don't change the meaning? – LarsH Aug 05 '15 at 20:45