Using xpath and R, how can you extract only a part of a text string where strings are not consistent?

Question

Is there a way using xpath and R (not PHP) to pick out only a piece (the city) from a longer address string?

Here is the relevant portion of the content of the following webpage:

http://www.kentmcbride.com/offices/

<table id="offices" cellspacing="8" width="700" height="100" border="0">
<tbody>
<tr>
<td valign="top">
<h2>
<img width="122" height="22" src="/_common/sub_philadelphia.png">
</h2>
<p>
1617 JFK Boulevard
<br>
Suite 1200
<br>
Philadelphia, PA 19103
</p>
</td>
<td valign="top">
<td valign="top">
</tr>

Parsing the content and using xpath expression, R returns the entire string address (remainder omitted), but I only want the city (and I do not know the city until I look at the returned content).

require(XML)
doc <- htmlTreeParse('http://www.kentmcbride.com/offices/', useInternal = TRUE)
xpathSApply(doc, "//table[@id = 'offices']//p", xmlValue, trim = TRUE)

[1] "1617 JFK Boulevard\n                Suite 1200\n                Philadelphia, PA 19103"                        
[2] "1040 Kings Highway North\n                Suite 600\n                Cherry Hill, NJ 08034"                    
[3] "824 North Market Street\n                Suite 805 \n                Wilmington, DE 19801"

A previous question assumes I know the city name; I don't. XPath - How to extract specific part of the text from one text node

Is there a way to obtain only the city?

score 4 · Accepted Answer · answered Sep 08 '14 at 12:09

4

If we can assume the "city" is the final line then you can select the last text nodes following the <br> nodes. So in xpath this would be

text()[preceding-sibling::br][last()]

that is the text nodes that have a br node preceding them and then we want only the last of these:

require(XML)
doc <- htmlTreeParse('http://www.kentmcbride.com/offices/', useInternal = TRUE)
xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]")

> xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]")
[[1]]

                Philadelphia, PA 19103               

[[2]]

                Cherry Hill, NJ 08034 

[[3]]

                Wilmington, DE 19801 

[[4]]

                Blue Bell, PA 19422


[[5]]

                Iselin, NJ 08830 

[[6]]

                New York, NY 10170 

[[7]]

              Pittsburgh, PA 15222

answered Sep 08 '14 at 12:09

jdharrison

30,085
4
77
89

thank you very much. I have to test this to see if it breaks if the address lines are not the same number. I am doing this with other law firms, also, and am hopeful the techniques are flexible enough to apply. – lawyeR Sep 08 '14 at 12:50
@lawyeR it just assumes the city is the last line. For reference the 7th address in the example has 4 lines versus others which have 3 – jdharrison Sep 08 '14 at 12:53
Just curious why you included `[preceding-sibling::br]` in your answer. It wouldn't make any difference vs. just `text()[last()]` unless the last text node child had no `
` element before it. I guess you're saying you don't want to interpret a one-line address as a city, but the 2nd line of a two-line address, you do? – LarsH Sep 08 '14 at 15:35
@LarsH in this case `text()[last()]` would suffice. The preceding br was included for flexibility depending on the OP's real world case. In the example given we could just take the last text node. – jdharrison Sep 08 '14 at 15:38

hrbrmstr · Answer 2 · 2014-09-08T12:40:02.403

@jdharrison did the XPath hard work (i.e. credit to him for the answer). This extra bit (which you can't do with just XPath) grabs the city:

require(stringr)

unlist(lapply(xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]", xmlValue), function(x) {
  str_match(x, "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
}))

## [1] "Philadelphia" "Cherry Hill"  "Wilmington"   "Blue Bell"    "Iselin"       "New York"     "Pittsburgh"

Suggested Edit:

xpathSApply(doc, "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]"
            , function(x){
              str_match(xmlValue(x), "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
            }
)

Actually, that's a rly good idea. In fact, I should have stuck to a new idiom I've been trying since dplyr came out and eliminated the anonymous function altogether:

# to be used in xpathSApply below
extractCity <- function(last_line) {
  str_match(xmlValue(last_line), "^[[:space:]]*([[:alnum:][:blank:]]+),")[,2]
}

xpathSApply(doc, 
            "//table[@id = 'offices']//p/text()[preceding-sibling::br][last()]", 
            extractCity)

@hrbmstr: I feel naive, but I had never integrated other R functions into xpath like you did. A whole new world opens up!! — lawyeR, Sep 08 '14 at 12:51
You just have to watch the return value from the XPath (can be different depending on what was asked for in the Xpath). Rly handy for building data frames from "ugly" data. — hrbrmstr, Sep 08 '14 at 12:55

Using xpath and R, how can you extract only a part of a text string where strings are not consistent?

2 Answers2