regex, multiline extract in R

Question

I am having some problems with deleting everything after the first occurrence of a pattern in R. I have imported the data with paste(readLines(url), collapse="\n").

For example, my string is, \"id=\"fruit_info\">\n<tr class='thead'>\n<th colspan=2>Strawberries</th></table>\n</tr>\n</table>\n<tr class.

I want to remove everything after the first occurrence of </table>. What I want to see is;

\"id=\"fruit_info\">\n<tr class='thead'>\n<th colspan=2>Strawberries</th>

The methods I am trying do not seem to register the first </table> occurrence and not providing the intended results.

Thanks!

Try `sub(".*", "", x)` (if `x` is your string) – David Arenburg May 09 '15 at 19:52 — David Arenburg, May 09 '15 at 19:52

hwnd · Accepted Answer · 2015-05-09T20:10:00.690

7

Try using the inline (?s) modifier which forces the dot . to span across newline sequences.

sub('(?s)</table>.*', '', x, perl = TRUE)

edited May 09 '15 at 20:10

answered May 09 '15 at 19:55

hwnd

69,796
4
95
132

1

Thanks, I was having trouble understanding the spanning across newlines! – jim mako May 09 '15 at 20:03
Hmm, just `sub(".*", "", x)` worked for me. In which cases it will fail? – David Arenburg May 09 '15 at 20:10
@DavidArenburg, in the case of it having newline sequences =) See here https://regex101.com/r/cV7nD7/1 – hwnd May 09 '15 at 20:16
I get that, but OP read it as one string `x <-"\"id=\"fruit_info\">\n\nStrawberries\n\n\n – David Arenburg May 09 '15 at 20:19
OP read the string in and collapsed by newline, therefore the string would have newline characters in it. But I can't clarify, but I know you need the dotall modifier to match across multiple newline sequences. – hwnd May 09 '15 at 20:21
Yes, it does. And my code still works on the above mentioned `x`. Though never-mind, it seems like you've got what OP wants, so that doesn't matter anymore I guess. – David Arenburg May 09 '15 at 20:24

regex, multiline extract in R

1 Answers1

Linked