2

I have a long string containing a mix of words and characters.

<h4>        <a href="/forum?id=SyBPtQfAZ">          Improving Discriminator-Generator Balance in Generative Adversarial Networks        </a>          <a href="/pdf?id=SyBPtQfAZ" class="pdf-link" title="Download PDF" target="_blank"><img src="/static/images/pdf_icon_blue.svg"/></a>              </h4>

I need to extract only the title:

Improving Discriminator-Generator Balance in Generative Adversarial Networks

I know R has the ability to extract words between 2 characters, such as:

sub(">.*<", "", my_string)

But this obviously won't work here as there are a mix of many characters.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Cybernetic
  • 12,628
  • 16
  • 93
  • 132
  • 3
    You should probably use an HTML parser here rather than a regex. – Tim Biegeleisen Nov 13 '17 at 02:35
  • Agreed. Although this leaves me with multiple substrings within the longer string (see revised question). If there was a way to detect the longest sentence inside a string and extract it, that would work. – Cybernetic Nov 13 '17 at 02:43

3 Answers3

3

You should probably be using an HTML parser here. That being said, the following one liner with gsub might work:

gsub(".*?<a href=[^>]*>\\s*(.*?)\\s*</a>.*", "\\1", input)

I say might because I make many assumptions, including that the title anchor tag is the first one, and that you don't have nested content. In practice, you can try using an HTML/XML parser for greater control.

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
3

Assuming that u is the URL from where you obtained this HTML, a HTML parsing solution might look like:

library(rvest)
titles <- read_html(u) %>%
  html_nodes("a[href^='/forum']") %>% 
  html_text() %>%
  trimws()

This assumes that the href for titles starts with /forum and uses trimws to remove leading and trailing spaces.

neilfws
  • 32,751
  • 5
  • 50
  • 63
2

You should not rely on regex for parsing HTML/XML - it is very fragile and prone to breaking. Consider using rvest. You can take HTML from any source and read_html() to parse it. html_text() extracts only the text elements and trimws trims excess whitespace that often exists in HTML.

library(rvest)
string = '<h4>        <a href="/forum?id=SyBPtQfAZ">          Improving Discriminator-Generator Balance in Generative Adversarial Networks        </a>          <a href="/pdf?id=SyBPtQfAZ" class="pdf-link" title="Download PDF" target="_blank"><img src="/static/images/pdf_icon_blue.svg"/></a>              </h4>'
read_html(string) %>% 
  html_text() %>% 
  trimws()
Mark
  • 4,387
  • 2
  • 28
  • 48
  • See the answer given by @neilfws – Tim Biegeleisen Nov 13 '17 at 02:55
  • @neilfws bases his answer on reading in a full webpage and extracting the node. My answer demonstrates that you can apply the same process to just the string (if that string is being obtained from a source other than a URL). – Mark Nov 13 '17 at 02:57