0

I am new to getting information from the web into R but I found this nice code How to get google search results on how to get links from the ordinary google search into R.

I need to get this method running for the google NEWS search. I know i have to change the url by adding something like "&source=lnms&tbm=nws". The url i construct leads me to the right news result page if i copy and paste it from R to my browser - so far so good.

I was looking at the html code of the news result page and found that the information is lying inside h3[@class='r dO0Ag'] but there is another node and I don´t know how to code this part. Would appreciate any help! Screenshot of HTML 1st News Result for China

library(XML)
library(RCurl)



getGoogleURL <- function(search.term, domain = '.de', quotes=TRUE) 
{
  search.term <- gsub(' ', '%20', search.term)
  if(quotes) search.term <- paste('%22', search.term, '%22', sep='') 
  #construct google news url
  getGoogleURL <- paste('http://www.google', domain, '/search?q=',
                        search.term, sep='',"&source=lnms&tbm=nws")
  return(getGoogleURL)
}

getGoogleLinks <- function(google.url) {
  doc <- getURL(google.url, httpheader = c("User-Agent" = "R
                                           (2.10.0)"))
  html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
                        (...){})
  #?? Wrong part - gives error evaluating xpath expression ??
  nodes <- getNodeSet(html, "//h3[@class='r dO0Ag']//a[@class='l lLrAF'//")

  dirt_links=sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])

  links <- gsub('/url\\?q=','',sapply(strsplit(dirt_links[as.vector(grep('url',dirt_links))],split='&'),'[',1))
  return(links)
}

search.term <- "China"
quotes <- "TRUE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)

links <- getGoogleLinks(search.url)
www.pieronigro.de
  • 840
  • 2
  • 12
  • 30

1 Answers1

2

You have a number of options here.

Either RCurl or RSelenium will work.

The key point is to generate the correct URL:

> library(XML)
> library(RCurl)
> search.term <- "china"
> quotes=FALSE
> start=0
> getGoogleURL <- paste('http://www.google.com',
+                       '/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
+                       search.term, "&start=",start,sep='')
> getGoogleURL
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=china&start=0"
> 

at this point, you can dereference the URL and create the HTML parse tree and extract the node data. The start reference allows you to set the return page of the result. i.e. I want to return the forth page (counting from zero)

Working Code Example:

library(XML)
library(RCurl)

getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
  search.term <- gsub(' ', '%20', search.term)
  if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
  getGoogleURL <- paste('http://www.google.com',
                        '/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
                        search.term, "&start=",start,sep='')
  getGoogleURL <- URLencode(getGoogleURL)
}


getGoogleNews <- function(search.term="China",
                          start=0,
                          quotes=FALSE ){
  google.url <- getGoogleURL(search.term=search.term,
                             start, quotes=quotes)
  print(google.url)
  doc <- getURL(google.url,
                httpheader = c("User-Agent" = "R(3.0.3)"))
  html <- htmlTreeParse(doc, useInternalNodes = TRUE,
                        error=function(...){}, asText = TRUE)
  nodes <- getNodeSet(html, "//*/h3/a[@href]")
  title <- sapply(nodes, function(x) x <- xmlValue(x))
  url <- unname(sapply(nodes, function(x) x <- xmlAttrs(x)))
  url <- gsub("\\/url\\?q=", "", url)
  nodes <- getNodeSet(html, "//div[@class='slp']")
  source <- sapply(nodes, function(x) x <- xmlValue(x))
  nodes <- getNodeSet(html, "//div[@class='st']")
  summary <- sapply(nodes, function(x) x <- xmlValue(x))
  data.frame(title=title, source=source, url=url, summary=summary)
}

getGoogleNews("China")
getGoogleNews("China", 1)
getGoogleNews("China", 2)

Runtime:

> library(XML)

> library(RCurl)

> getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
+   search.term <- gsub(' ', '%20', search.term)
+   if(quotes) search.term <- paste( .... [TRUNCATED] 

> getGoogleNews <- function(search.term="China",
+                           start=0,
+                           quotes=FALSE ){
+   google.url <- ge .... [TRUNCATED] 

> getGoogleNews("China")
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=0"
                                                                         title
1     Taiwan says China is 'out of control' as it loses El Salvador to Beijing
2  China central bank official rebuts Trump's claim it is manipulating the ...
3                                         Airbnb Wants to Find a Home in China
4          China's biggest risk may be its property market — not the trade war
5      Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
6                                     China reaches 800 million internet users
7       China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
8                     7 Signs that China's Military is Becoming More Dangerous
9           Asia markets trade mostly higher as investors look ahead to US ...
10       Can China, the world's biggest pork producer, contain a fatal pig ...
                                               source
1                                 CNBC - 17 hours ago
2                                 CNBC - 10 hours ago
3                                WIRED - 13 hours ago
4                                 CNBC - 23 hours ago
5                     Business Insider - 11 hours ago
6                           TechCrunch - 10 hours ago
7                        Express.co.uk - 12 hours ago
8  The National Interest Online (blog) - 16 hours ago
9                                 CNBC - 17 hours ago
10                     Science Magazine - 5 hours ago
                                                                                                                                                                                                                   url
1                      https://www.cnbc.com/2018/08/21/taiwan-says-china-out-of-control-as-it-loses-el-salvador-to-beijing.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIFCgAMAA&usg=AOvVaw2cSTmS65-6IvKQV9xrl3y3
2                          https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIHSgAMAE&usg=AOvVaw2q7yr2oBWHib3bRAVmOna-
3                                                                              https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIJigAMAI&usg=AOvVaw2a2LSkYlosnwTFRCvjmUhm
4                          https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIKSgAMAM&usg=AOvVaw1bUY5Ii7AlWURDifpeozJU
5                       https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIILCgAMAQ&usg=AOvVaw0yGdVilstHZVBBXEuuAbmu
6                                                   https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIINSgAMAU&usg=AOvVaw0VYTngAb-OBUSYkxKs0ZKp
7  https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIOCgAMAY&usg=AOvVaw3W5adCnWdzz71zvpgE1x6D
8                                  https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIPigAMAc&usg=AOvVaw1k05lyvFRrx_FImDKIsZ61
9                                               https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIQSgAMAg&usg=AOvVaw0YqzZPNbH9bawkv8qX8Bdm
10 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIRCgAMAk&usg=AOvVaw1H0c03l4trLI3cbRRlnKJW
                                                                                                                                                                    summary
1                            Taiwan vowed on Tuesday to fight China's "increasingly out of control" behavior after Taipei lost another ally to Beijing when El Salvador ...
2                   A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
3                                  China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
4                       China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
5  The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
6                            A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
7                             China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
8                          Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
9                             Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
10                       As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...

> getGoogleNews("China", 1)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=1"
                                                                         title
1  China central bank official rebuts Trump's claim it is manipulating the ...
2                                         Airbnb Wants to Find a Home in China
3          China's biggest risk may be its property market — not the trade war
4      Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
5                                     China reaches 800 million internet users
6       China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
7                     7 Signs that China's Military is Becoming More Dangerous
8           Asia markets trade mostly higher as investors look ahead to US ...
9        Can China, the world's biggest pork producer, contain a fatal pig ...
10      How China, India and the US use healthcare aid to win influence in ...
                                               source
1                                 CNBC - 10 hours ago
2                                WIRED - 13 hours ago
3                                 CNBC - 23 hours ago
4                     Business Insider - 11 hours ago
5                           TechCrunch - 10 hours ago
6                        Express.co.uk - 12 hours ago
7  The National Interest Online (blog) - 16 hours ago
8                                 CNBC - 17 hours ago
9                      Science Magazine - 5 hours ago
10                             ABC News - 5 hours ago
                                                                                                                                                                                                                      url
1                          https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggUKAAwAA&usg=AOvVaw1Muu65XvSSWVKX06-5syLY
2                                                                              https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggdKAAwAQ&usg=AOvVaw0Py7bJDY3tIj4KxgwYot1A
3                          https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgggKAAwAg&usg=AOvVaw2EHMCQvFQV9ubu17ERCZFO
4                       https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggjKAAwAw&usg=AOvVaw1sMhG0tyUnj8j2W02gD3aW
5                                                   https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggsKAAwBA&usg=AOvVaw1ODs1JY8V_ETi24ugz-yNn
6  https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggvKAAwBQ&usg=AOvVaw0r0HQNfZhEwfbiEocUC74Z
7                                  https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg1KAAwBg&usg=AOvVaw2hpQQXrAm2HW158II7F1kG
8                                               https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg4KAAwBw&usg=AOvVaw2surM3fW-lLJDd9P-r7xJB
9  http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg7KAAwCA&usg=AOvVaw3Lzvks6B0Un4IEgoMh86re
10                               http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg-KAAwCQ&usg=AOvVaw1Ogg8I6mUvDSCc9F90Usg4
                                                                                                                                                                    summary
1                   A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
2                                  China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
3                       China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
4  The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
5                            A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
6                             China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
7                          Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
8                             Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
9                        As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
10                          China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...

> getGoogleNews("China", 2)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=2"
                                                                      title
1                                      Airbnb Wants to Find a Home in China
2       China's biggest risk may be its property market — not the trade war
3   Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
4                                  China reaches 800 million internet users
5    China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
6                  7 Signs that China's Military is Becoming More Dangerous
7        Asia markets trade mostly higher as investors look ahead to US ...
8     Can China, the world's biggest pork producer, contain a fatal pig ...
9    How China, India and the US use healthcare aid to win influence in ...
10 China Is Leading in Artificial Intelligence--and American Businesses ...
                                               source
1                                WIRED - 13 hours ago
2                                 CNBC - 23 hours ago
3                     Business Insider - 11 hours ago
4                           TechCrunch - 10 hours ago
5                        Express.co.uk - 12 hours ago
6  The National Interest Online (blog) - 16 hours ago
7                                 CNBC - 17 hours ago
8                      Science Magazine - 5 hours ago
9                              ABC News - 5 hours ago
10                             Inc.com - 16 hours ago
                                                                                                                                                                                                                      url
1                                                                              https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggUKAAwAA&usg=AOvVaw3M4FbZ71J-NVKHn3fHvYwZ
2                          https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggXKAAwAQ&usg=AOvVaw3vieYvDvTlRzYkWncLgQfu
3                       https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggaKAAwAg&usg=AOvVaw3JGNk2Lraivca0P1lS3CoY
4                                                   https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggjKAAwAw&usg=AOvVaw2j4-NkfK_fNl8McD6WJjPa
5  https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggmKAAwBA&usg=AOvVaw0v1Lybg2SxcJoxVkP7sOx_
6                                  https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggsKAAwBQ&usg=AOvVaw1B7Krdzgd3LQEJ4bwWSSFW
7                                               https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggvKAAwBg&usg=AOvVaw0v734CDRel2Vpke9XVjLqA
8  http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggyKAAwBw&usg=AOvVaw1j6E7a1jk9JiIahN5pdmi7
9                                http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg1KAAwCA&usg=AOvVaw2E0qGfLhOkKZWhh5-_Is54
10                                              https://www.inc.com/magazine/201809/amy-webb/china-artificial-intelligence.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg4KAAwCQ&usg=AOvVaw1thfiF9hJWhz88BU8znvnD
                                                                                                                                                                    summary
1                                  China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
2                       China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
3  The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
4                            A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
5                             China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
6                          Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
7                             Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
8                        As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
9                           China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
10                       Living in China in the early 2000s changed my perspective. I saw firsthand that the outside world's view--China was good at copying but bad at ...
> 

Web Page Test of URL

enter image description here

Nb. Note result order will be different for different users via web page for a logged in user.

Citation:

Jinseog Kim - Associate professor in the Department of Applied Statistics at Dongguk University. He received Ph.D of Statistics in 2003 in Department of Statistics at Seoul National University. His research interests are data mining related topics including machine learning, big data analytics, networked data analysis.

Presentation Link: http://datamining.dongguk.ac.kr/lectures/2016-2/bigdata/google.pdf

Technophobe01
  • 8,212
  • 3
  • 32
  • 59
  • Thanks @Technophobe01 this looks great. Can you explain what the parameter "start" is for? Do i get the most recent news with this method? – www.pieronigro.de Aug 21 '18 at 15:56
  • @Hiatus Q1 - In reference to 'start', Google returns multiple pages of results. Start=0 is the first page, Start=1 is the next and so on. Q2 - Recent News - yes this returns the latest Google news at the point of invocation. You might want to timestamp when you run the query. That way you can correlate news/order with a date, though results are different for different users. – Technophobe01 Aug 21 '18 at 18:05
  • @Hiatus - Whoops, updated the answer to correctly return the right results based on the start page passed in. Code example updated. – Technophobe01 Aug 22 '18 at 00:53