4

I am having some problems with deleting everything after the first occurrence of a pattern in R. I have imported the data with paste(readLines(url), collapse="\n").

For example, my string is, \"id=\"fruit_info\">\n<tr class='thead'>\n<th colspan=2>Strawberries</th></table>\n</tr>\n</table>\n<tr class.

I want to remove everything after the first occurrence of </table>. What I want to see is;

\"id=\"fruit_info\">\n<tr class='thead'>\n<th colspan=2>Strawberries</th>

The methods I am trying do not seem to register the first </table> occurrence and not providing the intended results.

Thanks!

jim mako
  • 541
  • 2
  • 9
  • 28

1 Answers1

7

Try using the inline (?s) modifier which forces the dot . to span across newline sequences.

sub('(?s)</table>.*', '', x, perl = TRUE)
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • 1
    Thanks, I was having trouble understanding the spanning across newlines! – jim mako May 09 '15 at 20:03
  • Hmm, just `sub(".*", "", x)` worked for me. In which cases it will fail? – David Arenburg May 09 '15 at 20:10
  • @DavidArenburg, in the case of it having newline sequences =) See here https://regex101.com/r/cV7nD7/1 – hwnd May 09 '15 at 20:16
  • I get that, but OP read it as one string `x <-"\"id=\"fruit_info\">\n\nStrawberries\n\n\n – David Arenburg May 09 '15 at 20:19
  • OP read the string in and collapsed by newline, therefore the string would have newline characters in it. But I can't clarify, but I know you need the dotall modifier to match across multiple newline sequences. – hwnd May 09 '15 at 20:21
  • Yes, it does. And my code still works on the above mentioned `x`. Though never-mind, it seems like you've got what OP wants, so that doesn't matter anymore I guess. – David Arenburg May 09 '15 at 20:24