1

I have a text file which want to convert it to data frame. The text is messy, and needs cleaning, removing a couple of repetitive sentences, replace new line (the wildcard in word is "^p" to tab or comma and ...

for example my text file is like:

-The data 1 is taken on Aug, 2009 at UBC
and is significant with p value <0.01

-The data 2 is taken on Sep, 2012 at SFU
and is  not significant with p value > 0.06

how can I can I do multiple find and replace. I used this code:

tx = readLines("My_text.txt")
tx2 = gsub(pattern = "is taken on", replace = " ", x = tx)
tx3 = gsub(pattern = "at", replace = " ", x = tx2)
writeLines(tx3, con="tx3.txt")

But I do not know how can I replace "at" to tab (^t) or how can I replace (^p) with , or for example space^p ( ^p) with ,

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Lionette
  • 83
  • 1
  • 8

2 Answers2

2

Use regular expressions to take account for word boundaries \\b.

To avoid multiple gsub() we could use a replacement matrix rmx.

rmx <- matrix(c("\\sis taken on\\s\\b", " ",  
                "\\b\\sat\\s", "\t"          #  replace with tab
                ), 2)        
#      [         ,1]                   [,2]         
# [1,] "\\sis taken on\\s\\b" "\\b\\sat\\s"
# [2,] " "                    "\t"   

Now we may feed gsub() with rmx column by column using apply(). To make permanent changes to tx we can use the <<- operator. To avoid spamming the console, we could wrap the whole thing with an invisible().

tx <- readLines("My_text.txt")
invisible(
  apply(rmx, MARGIN=2, function(x) tx <<- gsub(x[1], x[2], tx))
  )

To get continuous text rather than paragraphs (what I assume you mean by ^p-replacement) we could simply paste() the result, collapseing by ,. The empty strings should be filtered out with tx != "".

tx <- paste(tx[tx != ""], collapse=", ")

Now writeLines().

writeLines(tx, con="tx4.txt")

Result

-The data 1 Aug, 2009 UBC, and is significant with p value <0.01, -The data 2 Sep, 2012 SFU, and is not significant with p value > 0.06

Appendix

We also may replace special characters in R by double-escape them – read this post.

gsub("\\$", "\t", "today$is$monday")
# [1] "today\tis\tmonday"
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • thanks. for tab we can use : \t as you mentioned. But what about selecting next line or paragraph? in word we use ^t for tab and ^p for paragraph. but here \p did not work for me. – Lionette Oct 07 '19 at 19:23
  • I also noticed that if we have special characters like ($) in your find terms, it won't work. For example I cannot replace "today$is$monday" with "\t" – Lionette Oct 07 '19 at 19:25
  • 1
    @Lionette Ahaaa, I didn't realize the "word-code", since I'm not working with it. I assume now you want to connect the paragraphs by commas. See update. I'm sure you may refine this solution to your specific needs. – jay.sf Oct 07 '19 at 19:48
1

Using the regex supplied by jay.sf, you could use str_replace_all from the stringr package to do it with a named vector.

library(stringr)

new_tx <- str_replace_all(tx,
                          c("\\sis taken on\\s" = " ",
                            "\\b\\sat\\s" = "\t",
                            "\\b\\sp\\b" = ","))

cat(new_tx)

Result

-The data 1 Aug, 2009    UBC
and is significant with, value <0.01

-The data 2 Sep, 2012    SFU
and is  not significant with, value > 0.06
caldwellst
  • 5,719
  • 6
  • 22