gsub and str_extract

Question

I have the following character string

test <- "Mr Flowerpower discusses the challenges for the flower economy\r\nSpeech given by the Head of the Bank  
of Flowerland, Mr Yellow Flowerpower, at the Flowerland meeting\r\non 27 July 2089.\r\n   
 *    *    *\r\nI.         Introduction\r\nIt is a great day to talk to all these flower investors.  "

which is an input from a pdf text. My aim would be to extract everything up to the stars * * *.

Use gsub - match everything after the pattern of the stars an replace it by blank space

gsub("\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*.*","",test)
[1] "Mr Flowerpower discusses the challenges for the flower economy\r\nSpeech given by the Head of the Bank  
of Flowerland, Mr Yellow Flowerpower, at the Flowerland meeting\r\non 27 July 2089.\r"

Use str_extract: I would like to extract everything (.*) before the pattern:

str_extract(test, ".*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*")
[1] "\n   *    *    *"

However, the second option does not work. I think it does not work because . does not match "/n". However, what would be the right approach here to extract everything before the * * * pattern? Thanks for your help!

Just add `(?s)` at the pattern start to make `.` match line break chars. — Wiktor Stribiżew, Feb 05 '20 at 10:37
Try `str_extract(test, "[^\\*]*")` to extract all the characters that are not underscores. — meenaparam, Feb 05 '20 at 10:42
Just updated [the answer](https://stackoverflow.com/a/45981809/3832970) with R `stringr` example. — Wiktor Stribiżew, Feb 05 '20 at 10:49
thanks @WiktorStribiżew! but str_extract(test, "(?s)(.*)\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*") does not work for me — Methamortix, Feb 05 '20 at 10:56
`trimws(str_match(test, "(?s)(.*)\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*")[,2])` should. Or `trimws(str_match(test, "(?s)(.*)\\s{4}\\*\\s{4}\\*\\s{4}\\*")[,2])`. Shortening further - `"(?s)(.*)(?:\\s{4}\\*){3}"` — Wiktor Stribiżew, Feb 05 '20 at 10:58
Or, `trimws(str_extract(test, "(?s)(.*)(?=(?:\\s{4}\\*){3})"))`. The main issue is that `.` did not match line breaks, and this is a known issue. — Wiktor Stribiżew, Feb 05 '20 at 11:01
My fault sorry. It works perfectly. Could you provide me with some source where I can read more about the regular expression syntax? All the sources i find are quite superficial and I have a really hard time to deeply understand it! — Methamortix, Feb 05 '20 at 16:57

gsub and str_extract

0 Answers0