0

I have a 1 million lines file, which once read with readLines can be condensed to:

prob <- readLines("offendingFile.txt")
dput(prob)

c("000005928484|Name Nmee Leonel                        |YUMBO               |El Placer de El Cerrito   ALG 76248     |114|80041725|20140424|4132638|20140425|P|PED.ELE/100098-114       |Corregimiento de amaime", 
"", "          ||90300105       |V-1 MUIMERP NALBOC            |6.0000|30.820000|.0000|.00000000000000|6.0000|458114.67", 
"000005928484|Name Nmee Leonel                        |YUMBO               |El Placer de El Cerrito   ALG 76248     |114|80041725|20140424|4132638|20140425|P|PED.ELE/100098-114       |Corregimiento de amaime", 
"", "          ||90400105       |V-2 MUIMERP NALBOC            |3.0000|29.170000|.0000|.00000000000000|3.0000|169750.62", 
"000005928484|Name Nmee Leonel                        |YUMBO               |El Placer de El Cerrito   ALG 76248     |114|80041725|20140424|4132638|20140425|P|PED.ELE/100098-114       |Corregimiento de amaime", 
"", "          ||90700101       |V-OCIMONOCE LOREMIPSUM        |12.0000|5.980000|.0000|.00000000000000|12.0000|107118.18", 
"000815004980|Odrareg Oinotna Namzug S. En C.S.       |YUMBO               |Rozo (Palmira)            ALG 76520     |114|80041726|20140424|4132636|20140425|P|PED.ELE/100099-114       |Corregimiento de palmira"
)

I want to remove the sequences of LFLF and spaces that are occurring in the file (that would result in removing rows 2, 5 and 8 and appending rows 3 to 1; 6 to 4 and 9 to 7 (original row numbering)). So I tried:

prob2 <- gsub("\n {2,}", "", prob) #  didn't do anything
gsub("[\r\n] {2,}", "", prob)
gsub("\r?\n {2,}|\r {2,}", "", prob)

The last two lines are borrowed from this SO post.

How should I proceed?

Expected output:

dput(prob2)

c("000005928484|Name Nmee Leonel                        |YUMBO               |El Placer de El Cerrito   ALG 76248     |114|80041725|20140424|4132638|20140425|P|PED.ELE/100098-114       |Corregimiento de amaime        ||90300105       |V-1 MUIMERP NALBOC            |6.0000|30.820000|.0000|.00000000000000|6.0000|458114.67", 
"000005928484|Name Nmee Leonel                        |YUMBO               |El Placer de El Cerrito   ALG 76248     |114|80041725|20140424|4132638|20140425|P|PED.ELE/100098-114       |Corregimiento de amaime        ||90400105       |V-2 MUIMERP NALBOC            |3.0000|29.170000|.0000|.00000000000000|3.0000|169750.62", 
"000005928484|Name Nmee Leonel                        |YUMBO               |El Placer de El Cerrito   ALG 76248     |114|80041725|20140424|4132638|20140425|P|PED.ELE/100098-114       |Corregimiento de amaime        ||90700101       |V-OCIMONOCE LOREMIPSUM        |12.0000|5.980000|.0000|.00000000000000|12.0000|107118.18", 
"000815004980|Odrareg Oinotna Namzug S. En C.S.       |YUMBO               |Rozo (Palmira)            ALG 76520     |114|80041726|20140424|4132636|20140425|P|PED.ELE/100099-114       |Corregimiento de palmira"
)
Community
  • 1
  • 1
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • This should get you closer: `prob3 <- gsub("[\r\n] {2,}", "", prob); prob4 <- prob3[!prob3 %in% ""]`. – JasonAizkalns May 18 '16 at 16:50
  • have you considered using something other than `readLines` (such as `read.delim`) that allows you to (1) skip blank lines, and (2) specify a within-line delimiter (not sure if you want to do that, but maybe `delim="|"` would be useful?) – drammock May 18 '16 at 17:22
  • @drammock thanks for your comment. At first I attempted to `data.table::fread` (since the file is big, it's a lot faster than `read.table` and cousins), but it complained because of different number of columns (caused by an unneeded LF, as shown in the example). In this specific case, that wouldn't be a solution. – PavoDive May 18 '16 at 22:26
  • @JasonAizkalns The last command in your comment produces a new vector without empty rows. The first one, however, doesn't introduce any changes into the vector `prob`, so the need to remove `\n {2,}` remain... – PavoDive May 18 '16 at 22:48

0 Answers0