How to extract title from string containing mix of special characters and words in R

Question

I have a long string containing a mix of words and characters.

<h4>        <a href="/forum?id=SyBPtQfAZ">          Improving Discriminator-Generator Balance in Generative Adversarial Networks        </a>          <a href="/pdf?id=SyBPtQfAZ" class="pdf-link" title="Download PDF" target="_blank"><img src="/static/images/pdf_icon_blue.svg"/></a>              </h4>

I need to extract only the title:

Improving Discriminator-Generator Balance in Generative Adversarial Networks

I know R has the ability to extract words between 2 characters, such as:

sub(">.*<", "", my_string)

But this obviously won't work here as there are a mix of many characters.

You should probably use an HTML parser here rather than a regex. — Tim Biegeleisen, Nov 13 '17 at 02:35
Agreed. Although this leaves me with multiple substrings within the longer string (see revised question). If there was a way to detect the longest sentence inside a string and extract it, that would work. — Cybernetic, Nov 13 '17 at 02:43

score 3 · Accepted Answer · answered Nov 13 '17 at 02:42

3

You should probably be using an HTML parser here. That being said, the following one liner with gsub might work:

gsub(".*?<a href=[^>]*>\\s*(.*?)\\s*</a>.*", "\\1", input)

I say might because I make many assumptions, including that the title anchor tag is the first one, and that you don't have nested content. In practice, you can try using an HTML/XML parser for greater control.

Demo

answered Nov 13 '17 at 02:42

Tim Biegeleisen

502,043
27
286
360

Perfect! That does it! Thank you! – Cybernetic Nov 13 '17 at 02:44

score 3 · Answer 2 · answered Nov 13 '17 at 02:52

Assuming that u is the URL from where you obtained this HTML, a HTML parsing solution might look like:

library(rvest)
titles <- read_html(u) %>%
  html_nodes("a[href^='/forum']") %>% 
  html_text() %>%
  trimws()

This assumes that the href for titles starts with /forum and uses trimws to remove leading and trailing spaces.

Mark · Answer 3 · 2017-11-13T02:58:41.143

2

You should not rely on regex for parsing HTML/XML - it is very fragile and prone to breaking. Consider using rvest. You can take HTML from any source and read_html() to parse it. html_text() extracts only the text elements and trimws trims excess whitespace that often exists in HTML.

library(rvest)
string = '<h4>        <a href="/forum?id=SyBPtQfAZ">          Improving Discriminator-Generator Balance in Generative Adversarial Networks        </a>          <a href="/pdf?id=SyBPtQfAZ" class="pdf-link" title="Download PDF" target="_blank"><img src="/static/images/pdf_icon_blue.svg"/></a>              </h4>'
read_html(string) %>% 
  html_text() %>% 
  trimws()

edited Nov 13 '17 at 02:58

answered Nov 13 '17 at 02:54

Mark

4,387
2
28
48

See the answer given by @neilfws – Tim Biegeleisen Nov 13 '17 at 02:55
@neilfws bases his answer on reading in a full webpage and extracting the node. My answer demonstrates that you can apply the same process to just the string (if that string is being obtained from a source other than a URL). – Mark Nov 13 '17 at 02:57

How to extract title from string containing mix of special characters and words in R

3 Answers3

Demo