0

i am trying to pic the only url /~/ to .ashx wich is within the quots. from the complete html source file wich i have scraped , i tried the below function to get href match list .

processHTML <- function(html) {
  doc <- htmlTreeParse(html, useInternalNodes=TRUE)
  text <- xpathSApply(doc, "//a/@href")
}

from the below code snippet i need to pic only excluding the href and qoutations , /~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx:

href   "/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"

please help me out with regular expression for above problem

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
vivek goud
  • 11
  • 4

1 Answers1

1

If I understood the question properly then this might help

txt[grepl('.ashx', txt)][['href']]

Output is:

[1] "/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"

Sample data:

txt <- structure(c("mailto:?subject=From%20mckinsey.com%3a%20Discussions%20in%20digital%3a%20What%e2%80%99s%20a%20marketing%20ecosystem%20and%20what%20does%20it%20mean%20for%20marketers%3f&body=I%20recommend%20you%20visit%20mckinsey.com%20to%20read%3a%0d%0a%0d%0aDiscussions%20in%20digital%3a%20What%e2%80%99s%20a%20marketing%20ecosystem%20and%20what%20does%20it%20mean%20for%20marketers%3f%0d%0ahttp%3a%2f%2fwww.mckinsey.com%2fbusiness-functions%2fmarketing-and-sales%2four-insights%2fdiscussions-in-digital-whats-a-marketing-ecosystem%3fcid%3deml-web", 
"/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx"
), .Names = c("href", "href"))
Prem
  • 11,775
  • 1
  • 19
  • 33
  • thanks for help but i need to pick the complete url from /~/ till the end .ashx "/~/media/McKinsey/Business Functions/Marketing and Sales/Our Insights/Discussions in digital Whats a marketing ecosystem/Discussions-in-digital-Marketings-ecosystem.ashx" – vivek goud Mar 27 '18 at 06:29
  • Isn't the o/p shown above exactly same as what you expect? I would suggest to update your post with the exact i/p (using `dput`) and desired o/p otherwise I am afraid this question will be closed soon ([this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) link might help). – Prem Mar 27 '18 at 06:33
  • 1
    it worked . Thankyou so much ...! – vivek goud Mar 27 '18 at 06:44