0

I have the following character string

test <- "Mr Flowerpower discusses the challenges for the flower economy\r\nSpeech given by the Head of the Bank  
of Flowerland, Mr Yellow Flowerpower, at the Flowerland meeting\r\non 27 July 2089.\r\n   
 *    *    *\r\nI.         Introduction\r\nIt is a great day to talk to all these flower investors.  "

which is an input from a pdf text. My aim would be to extract everything up to the stars * * *.

  1. Use gsub - match everything after the pattern of the stars an replace it by blank space
gsub("\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*.*","",test)
[1] "Mr Flowerpower discusses the challenges for the flower economy\r\nSpeech given by the Head of the Bank  
of Flowerland, Mr Yellow Flowerpower, at the Flowerland meeting\r\non 27 July 2089.\r"

  1. Use str_extract: I would like to extract everything (.*) before the pattern:
str_extract(test, ".*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*")
[1] "\n   *    *    *"

However, the second option does not work. I think it does not work because . does not match "/n". However, what would be the right approach here to extract everything before the * * * pattern? Thanks for your help!

s_baldur
  • 29,441
  • 4
  • 36
  • 69
Methamortix
  • 70
  • 10
  • 1
    Just add `(?s)` at the pattern start to make `.` match line break chars. – Wiktor Stribiżew Feb 05 '20 at 10:37
  • 1
    Try `str_extract(test, "[^\\*]*")` to extract all the characters that are not underscores. – meenaparam Feb 05 '20 at 10:42
  • 1
    Just updated [the answer](https://stackoverflow.com/a/45981809/3832970) with R `stringr` example. – Wiktor Stribiżew Feb 05 '20 at 10:49
  • thanks @WiktorStribiżew! but str_extract(test, "(?s)(.*)\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*") does not work for me – Methamortix Feb 05 '20 at 10:56
  • 1
    `trimws(str_match(test, "(?s)(.*)\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*\\s\\s\\s\\s\\*")[,2])` should. Or `trimws(str_match(test, "(?s)(.*)\\s{4}\\*\\s{4}\\*\\s{4}\\*")[,2])`. Shortening further - `"(?s)(.*)(?:\\s{4}\\*){3}"` – Wiktor Stribiżew Feb 05 '20 at 10:58
  • 1
    Or, `trimws(str_extract(test, "(?s)(.*)(?=(?:\\s{4}\\*){3})"))`. The main issue is that `.` did not match line breaks, and this is a known issue. – Wiktor Stribiżew Feb 05 '20 at 11:01
  • My fault sorry. It works perfectly. Could you provide me with some source where I can read more about the regular expression syntax? All the sources i find are quite superficial and I have a really hard time to deeply understand it! – Methamortix Feb 05 '20 at 16:57

0 Answers0