0

Original title: Keep newline character in string during gsub

There is a post, where I try to convert JSON to markdown unordered lists. It is almost done, but there is a pattern which I can not handle. If a string has a space, newline, space sequence in it, then it will be treated as the list item hyphen. If I try to avoid this using some reference to a newline character, then nothing works as I expect.

Input JSON: https://gist.github.com/hermanp/381eaf9f2bf5f2b9cdf22f5295e73eb5
Preferred output (two space indentation) markdown:

- Info
  - Python
    - The Ultimate Python Beginner's Handbook
    - Python Like You Mean It
    - Automate the Boring Stuff with Python
    - Data science Python notebooks
  - Frontend
    - CodePen
    - JavaScript - Wikipedia
    - CSS-Tricks
    - Butterick’s Practical Typography
    - Front-end Developer Handbook 2019
    - Using Ethics In Web Design
    - Client-Side Web Development
  - Stack Overflow
  - HUP
  - Hope in Source

To generate the markdown, I use the following two scripts:
generate_md()

library(jsonlite)

generate_md <- function (jsonfile) {
  bmarks_json_lite <- fromJSON(txt = jsonfile)
  level1 <- bmarks_json_lite$children$children[[2]]
  markdown_result <- recursive_func(level = level1)
  return(markdown_result)
}

recursive_func()

recursive_func <- function (level) {
  md_result <- character()
  
  for (i in seq_len(nrow(level))) {
    if (level[i, "type"] == "text/x-moz-place"){
      md_title <- paste0("- ", level[i, "title"], "\n")
    } else if (level[i, "type"] == "text/x-moz-place-container") {
      md_title <- paste0("- ", level[i, "title"], "\n")
      md_recurs <- recursive_func(level = level[i, "children"][[1]])
      
      # >>>>> This is the problematic part. <<<<<
      md_recurs <- gsub("-(?= )", "  -", md_recurs, perl = T)
      md_title <- paste0(md_title, md_recurs)
    }
    
    md_result <- paste0(md_result, md_title)
  }
  
  return(md_result)
}

With these functions I can achieve the following (note the unnecessary spaces at the JavaScript Wikipedia entry). I want to get - JavaScript - Wikipedia instead - JavaScript - Wikipedia. I hope this example represents the different scenarios with hyphens and indentation, but still, this is just a fraction of my bookmarks. I wanted to give a minimal example.

cat(generate_md(paste0("https://gist.githubusercontent.com/hermanp/",
                       "381eaf9f2bf5f2b9cdf22f5295e73eb5/raw/",
                       "76b74b2c3b5e34c2410e99a3f1b6ef06977b2ec7/",
                       "bookmarks-example-hyphen.json")))
# Output
- Info
  - Python
    - The Ultimate Python Beginner's Handbook
    - Python Like You Mean It
    - Automate the Boring Stuff with Python
    - Data science Python notebooks
  - Frontend
    - CodePen
    - JavaScript     - Wikipedia
    - CSS-Tricks
    - Butterick’s Practical Typography
    - Front-end Developer Handbook 2019
    - Using Ethics In Web Design
    - Client-Side Web Development
  - Stack Overflow
  - HUP
  - Hope in Source

I modified the gsub function part in recursive_func as seen below, without the desired output:

md_recurs <- gsub("-(?= )", "  -", md_recurs, perl = T)  # Original
md_recurs <- gsub("(\n)?-(?= )", "  -", md_recurs, perl = T)  # No newlines
md_recurs <- gsub("(-)(?= )(?<=\n)?", "  -", md_recurs, perl = T)  # Same as Original

Searching for regex newline before char gsub site:stackoverflow.com on Google, I find no answer or hint to this question. I also played with regex101.com, but could not find the right path.

hermanp
  • 61
  • 9
  • What does `md_recurs` hold? Try `sub("^\\h*\\K-(?=\\h)", " -", md_recurs, perl=TRUE)` – Wiktor Stribiżew Dec 04 '20 at 14:33
  • I should admit that I am new to recursion and can not answer your question well, but... That command is needed to make the right indentation for markdown. Otherwise it will not be a nested unordered list, just an unordered list. I am afraid I am not qualified enough to answer properly what `md_recurs` hold. – hermanp Dec 04 '20 at 15:07
  • Come on, you have the code, `md_recurs <- gsub("-(?= )", " -", md_recurs, perl = T)`, what is the text inside `md_recurs`? – Wiktor Stribiżew Dec 04 '20 at 16:03
  • Please see my answer to the SO post. The text inside `md_recurs` is recursively generated, therefore I do not know at what step should I inspect it and present it to you. Please bear with me: I do not have CS degree and were really happy after achieved the recursive function. I am eager to learn about a method to grab the value of this variable. – hermanp Dec 04 '20 at 16:24

2 Answers2

2

You can use

gsub("\\w\\h+-\\h(*SKIP)(*F)|-(?=\\h)", "  -", x, perl=TRUE)

See the regex demo. Details:

  • \w - a word char
  • \h+ - one or more horizontal whitespace
  • - - a -char
  • \h - a horizontal whitespace
  • (*SKIP)(*F) - omit text matched so far, fail the match and start searching from the location where it failed
  • | - or
  • - - a - char
  • (?=\h) - is immediately followed with a horizontal whitespace.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I really appreciate your quick and thorough answer! Thanks! But I had to modify my question. May I ask you to look at it? – hermanp Dec 04 '20 at 13:29
  • @hermanp Your post is too long now, and you do not explain what the new issue is. – Wiktor Stribiżew Dec 04 '20 at 13:43
  • I will rewrite the entire question. I thought that that first minimal example have captured the my whole problem, but after a try it seemed it did not. I edited my question to reflect the problem in a better way, but you are right that it got too long. Therefore I will rewrite it, but unfortunately, the input will not be as succinct as it was. – hermanp Dec 04 '20 at 13:55
  • @hermanp Try `sub("^\\h*\\K-(?=\\h)", " -", md_recurs, perl=TRUE)` – Wiktor Stribiżew Dec 04 '20 at 14:34
  • Sorry, but it's not good: the levels of the list are not nested in the correct way. I think your original answer may be the way, just needs some tinkering. Appreciate your work! I will try and comment here or update your answer if I find something! – hermanp Dec 04 '20 at 15:10
  • 1
    @hermanp I am not sure, but it seems you can try `gsub("\\w\\h+-\\h(*SKIP)(*F)|-(?=\\h)", " -", x, perl=TRUE)`. It will allow the same behavior as in your answer, but there can be any amount of horizontal whitespace that you used in the lookbehind. – Wiktor Stribiżew Dec 04 '20 at 18:39
  • Thank you for this suggestion. It produces the same output as my solution. – hermanp Dec 07 '20 at 10:07
  • @hermanp Yes, but it is more flexible because it allowed any one or more whitespace between the initial word char and the `-`. I updated tjhe answer. – Wiktor Stribiżew Dec 07 '20 at 10:20
0

After I thought over the problem and the structure of the string and read about lookbehind I finally came up with the solution.

The md_recurs row need to be modified as:

md_recurs <- gsub("(?<!(\\w ))-(?= )", "  -", md_recurs, perl = T)

Which means the gsub() pattern parameter had to be modified to:

(?<!(\\w ))-(?= )

Which means:

  • replace a hyphen - (to two space and a hyphen -)
  • if it is not preceded by a word string and a space (?<!(\\w )) and
  • if it is not followed by a space (?= ).
hermanp
  • 61
  • 9