0

I have a CSV fwith several columns: Tweet, date, etc. The spaces in some Tweets is causing blank lines and undesired truncated lines.

What works: 1. Using Notepad++'s function "Line Operations>Remove Empty Lines (Containing Blank Characters)" 2. Search and replace: \r with nothing.

However, I need to do this for a large number of files, and I can't manage to find a Regular Expression with gsub() in R that will do what the Notepadd++ function does.

Note that replacing ^[ \t]*$\r?\n with nothing and then \r with nothing does work in Notepad++, but not in R, as suggested here, but it does not work with g(sub) in R.

I have tried the following code:

tx <- readLines("tweets.csv") subbed <-gsub(pattern = "^[ \\t]*$\\r?\\n", replace = "", x = tx) subbed <-gsub(pattern = "\r", replace = "", x = subbed) writeLines(subbed, "output.csv")

This is the input:

Problems caused by spacing in Tweets

This is the desired output:

Desired output

1 Answers1

0

You may use

library(readtext)
tx <- readtext("weets.csv")
subbed <- gsub("(?m)^\\h*\\R?", "", tx$text, perl=TRUE)
subbed <- gsub("\r", "", subbed, fixed=TRUE)
writeLines(trimws(subbed), "output.csv")

The readtext llibrary reads the file into a single variable and thus all line break chars are kept.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks. The headers are there now, but there are still incomplete lines. I illustrate with some screen caps: Original [link] (https://drive.google.com/file/d/1xFSWDnl1A8EMWf4jK2PIAVtRSAF2aY5c/view?usp=sharing) Manual editing output: [link] (https://drive.google.com/file/d/1_dW8IvHdB9jvrPHnrS_kYJvGAOX9IYmM/view?usp=sharing) Output with this code: (https://drive.google.com/file/d/1EUTK7DSbgNXE72MVkbkwzWnogQDu2qYB/view?usp=sharing) E.g. lines 24, 35 are incomplete. – linguist_at_large Mar 06 '20 at 15:46
  • @linguist_at_large Ok, so you mean you need to read the file in a single variable, remove all blank lines and then remove CR symbols? – Wiktor Stribiżew Mar 06 '20 at 15:52
  • @linguist_at_large What are the linebreaks in the file? CRLF between lines and CR inside lines? What is the encoding? – Wiktor Stribiżew Mar 06 '20 at 15:55
  • I would just like to find the right regex so the output looks like the one I got by editing it manually with the method described above with Notepad++ (Remove Empty Lines (Containing Blank Characters) and then delete CRs" – linguist_at_large Mar 06 '20 at 15:56
  • @linguist_at_large It is not a matter of a regex because `readlines` read in *lines* and you need to read in the whole file into a single variable. – Wiktor Stribiżew Mar 06 '20 at 16:08
  • @linguist_at_large If it still does not work, you must share the file / part of the file you are working with. No images. – Wiktor Stribiżew Mar 06 '20 at 22:56