0

I want to grab only string from here :

                  <br>

                5 Brown Circle<br>

                Alabaster,

                AL &nbsp;&nbsp;

                35007

I need solid understanding how to extract text from the above portion following html doc:

<tr class="prem-tr" id="10425" role="row">
                    <td>
                        <h4><a class="prem-result-link" href="/Search/Details/10425">Graham &amp; Associates, CPAs</a></h4>

                        <a href="tel:+(205) 663-6673">(205) 663-6673</a>
                        <br>

                        5 Brown Circle<br>

                        Alabaster,

                        AL &nbsp;&nbsp;

                        35007

                        <div class="row result-btmRow">
                            <div class="col-sm-4">
                                <span class="result-dist"><small>Distance: 0.00 miles</small></span>
                            </div><!-- col6 -->
                            <div class="col-sm-8 result-actions">
                                <a id="WebsiteURL" class="visit-site" href="http://grahamandassociates.net" target="_blank">Visit Website</a>&nbsp;&nbsp;

                                <a class="send-email" href="/Search/Details/10425">Send a Message</a>
                            </div><!-- /col6 -->
                        </div><!-- /row -->
                    </td>
                </tr>

Expected output:5 Brown Circle, Alabaster, AL 35007 using only xpath along with explanation.

OR

In css selectors, it's working fine.Can anyone explain the following code? Thanks

" ".join([" ".join(el.root.strip().split()) for el in sel.css("td::text") if el.root.strip()])

2 Answers2

0

Handling of the &nbsp entity and the unclosed <br> tags may differ depending on exactly which XPath processor you are using, but the following will produce the exact result requested:

//td/text()[string-length(normalize-space(.)) > 0]/normalize-space(translate(.,'&#160;',''))

Where

  • //td selects all of the td nodes (just one in the example),
  • /text() selects all of the text nodes that are immediate children of the td,
  • predicate [string-length(normalize-space(.)) > 0] eliminates any text nodes that, when stripped of leading/trailing whitespace, are zero-length strings,
  • /normalize-space(translate(.,'&#160;','')) replaces the nbsp characters with nothing and eliminates leading/trailing whitespace from the remaining text nodes.
David Denenberg
  • 730
  • 4
  • 7
  • Great! but have a little bit question: What's the meaning (.) here and why do we use   to remove &nbsp as replacement. –  Aug 19 '21 at 12:33
  • The `&nbsp` entities are actual characters and you did not show them in your intended output. ' ' is just the numeric reference (see https://stackoverflow.com/questions/3274315/is-160-a-replacement-of-nbsp) . In XPath, the dot refers to the context item. For example, in the expression `text()[string-length(normalize-space(.)) > 0]` the dot refers to the text() node the predicate is applied to. – David Denenberg Aug 19 '21 at 12:40
  • Not working how? What errors or unintended output are you receiving? – David Denenberg Aug 19 '21 at 16:12
  • I was experimenting with my answer more and noticed it is actually returning two text nodes instead of one. If your XPath processor supports a version of XPath having the string-join() function, it may be possible to return this as one string, but your exact whitespace requirements may be challenging to achieve. – David Denenberg Aug 19 '21 at 16:19
0

I would not say this is a great solution, but if the requirement is to use only XPath 1.0...

normalize-space(translate(concat(//td/text()[4], //td/text()[5]),"\xa0", ""))

Breaking it down a bit and demonstrating in iPython with lxml.etree:

All of the text nodes that are children of the td can be selected with //td/text(). This excludes the name and phone number because they are descendants but not children.

In [73]: root.xpath('//td/text()')
Out[73]: 
['\n                        ',
 '\n\n                        ',
 '\n                        ',
 '\n\n                        5 Brown Circle',
 '\n\n                        Alabaster,\n\n                        AL \xa0\xa0\n\n                        35007\n\n                        ',
 '\n                    ']

Ideally we could join all these strings and normalize the whitespace with normalize-space(), but this is awkward because in XPath 1.0, we have only concat() available to us, which only takes two arguments. Handling this in Python with join() would be better, but because there are only two text nodes that we're interested in, we can use concat() to concatenate the fourth and fifth text nodes in the set for a pure XPath solution.

In [74]: root.xpath('concat(//td/text()[4], //td/text()[5])')
Out[74]: '\n\n                        5 Brown Circle\n\n                        Alabaster,\n\n                        AL \xa0\xa0\n\n                        35007\n\n                        '

Now we can apply normalize-space() to clean up the whitespace.


In [75]: root.xpath('normalize-space(concat(//td/text()[4], //td/text()[5]))')
Out[75]: '5 Brown Circle Alabaster, AL \xa0\xa0 35007'              '

Almost there. Now we just have to get rid of the non-breaking space characters with translate() before we normalize the space.

In [79]: root.xpath('normalize-space(translate(concat(//td/text()[4], //td/text()[5]),"\xa0", ""))')
Out[79]: '5 Brown Circle Alabaster, AL 35007'

Note that because this is Python, we must use \xa0 instead of &nbsp; or &#160; to represent the non-breaking space character.

Forensic_07
  • 1,125
  • 1
  • 6
  • 10
  • Why do we use translate method here? Will you explain a bit? –  Aug 19 '21 at 16:50
  • Certainly. `translate()` searches for instances of single characters in a string and replaces them with other single characters. If no character is specified, the original character is just removed instead. So here, `translate()` is being used to erase the non-breaking space characters (` ` in the original) so that the whitespace in the result matches your requirement. (`normalize-space()` handles most whitespace, but not non-breaking space characters.) – Forensic_07 Aug 19 '21 at 16:55
  • Well, You've concatenated text . Will you elaborate a bit of which portion indicate that text from the html doc? –  Aug 19 '21 at 17:10
  • Thanks. It's working fine but I'm not understanding only the concatenation 4 and 5 . Will you show me how have you done that? –  Aug 19 '21 at 17:43
  • 1
    name and phone number are grand children not direct child. Great! explanation. This time. I've understood everything. Thanks a lot. –  Aug 19 '21 at 17:58