0

I have a question about extracting a part of a string from several files that has these rows:

units = specified - name 0 = prDM: Pressure, Digiquartz [db] - name 1 = t090C: Temperature [ITS-90, deg C] - name 2 = c0S/m: Conductivity [S/m] - name 3 = t190C:Temperature, 2 [ITS-90, deg C] - name 4 = c1S/m: Conductivity, 2 [S/m] - name 5 = flSP: Fluorescence, Seapoint - name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l] - name 7 = altM: Altimeter [m] - name 8 = sal00: Salinity, Practical [PSU] - name 9 = sal11: Salinity, Practical, 2 [PSU] - span 0 = 1.000, 42.000

I need to extract only the information of the columns that start with "name" and extract everything between = and: . For example, in the row "name 0 = prDM: Pressure, Digiquartz [db]" the desired result will be prDM. Some files have different number of "name"rows (i.e. this example has 13 rows but other files has 16, and the number varies), so I want it to be as general as I can so I can allways extract the right strings independently the number of rows.Rows starts with # and a space before name. I have tried this code but it only extract the first row. Can you please help me with this? Many thanks!

CNV<-NULL
for (i in 1:nro.files){
x <- readLines(all.files[i])
name.col<-grep("^\\# name", x) 
df <- data.table::fread(text = x[name.col])
CNV[[i]]<-df
}

3 Answers3

0

using stringr and the regex pattern "name \\d+ = (.*?):" which means in words "name followed by one or more digits followed by an equals sign followed by a space followed by a captured group containing any character (the period) zero or more times (the *) followed by a colon".

   library(stringr)
    strings <- c("name 0 = prDM: Pressure, Digiquartz [db]",
    "name 1 = t090C: Temperature [ITS-90, deg C]",
    "name 2 = c0S/m: Conductivity [S/m]",
    "name 3 = t190C:Temperature, 2 [ITS-90, deg C]",
    "name 4 = c1S/m: Conductivity, 2 [S/m]",
    "name 5 = flSP: Fluorescence, Seapoint",
    "name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l]",
    "name 7 = altM: Altimeter [m]",
    "name 8 = sal00: Salinity, Practical [PSU]",
    "name 9 = sal11: Salinity, Practical, 2 [PSU]")

    result <- str_match(strings, "name \\d+ = (.*):")
    result[,2]
 [1] "prDM"       "t090C"      "c0S/m"      "t190C"      "c1S/m"      "flSP"       "sbeox0ML/L"
 [8] "altM"       "sal00"      "sal11"

Or if you prefer base

pattern = "name \\d+ = (.*):"
result <- regmatches(strings, regexec(pattern, strings))
sapply(result, "[[", 2)

 [1] "prDM"       "t090C"      "c0S/m"      "t190C"      "c1S/m"      "flSP"       "sbeox0ML/L"
 [8] "altM"       "sal00"      "sal11" 
Greg
  • 3,570
  • 5
  • 18
  • 31
  • that can be usefull, thanks! but my problem is that I have other columns that has = and : but i only need the ones that has the string "name" following by a number – Carla Berghoff Apr 21 '20 at 17:50
  • @CarlaBerghoff I'm not sure what your question is - please provide a reproducible example and your expected output per https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example If you just need to make sure the original string contains "name" you can add it to the regex `result <- str_match(strings, "name \\d = (.*):")` – Greg Apr 21 '20 at 18:20
  • here is one file as an example: https://www.dropbox.com/s/uq402dl22lingfo/CT01.cnv?dl=0 – Carla Berghoff Apr 21 '20 at 18:29
  • @CarlaBerghoff please read the first answer in the linked post in my comment above and produce an example like the post describes. – Greg Apr 21 '20 at 18:32
  • it does not work... I get the full first row that start with "name and get this warning: 1: In data.table::fread(text = x[result]) : Stopped early on line 3. Expected 2 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<# name 2 = c0S/m: Conductivity [S/m]>> – Carla Berghoff Apr 21 '20 at 19:45
0

Use str_extract from package stringr and positive lookahead and lookbehind:

str <- "name 0 = prDM: Pressure, Digiquartz [db]"

str_extract(str, "(?<== ).*(?=:)")
[1] "prDM"

Explanation:

(?<== )if you see =followed by white space on the left (lookbehind)

.* match anything until ...

(?=:)... you see a colon on the right (lookahead)

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • since other columns contains = and :, and I only want to extract the string in the ones that start with name, could you help me to adapt the code? I am so lost! Thanks! – Carla Berghoff Apr 21 '20 at 18:01
  • Can you add a few examples of these "other" strings? – Chris Ruehlemann Apr 21 '20 at 19:04
  • If I understand you correctly you have other strings that also contain `=` and `:`but which do not start with `name`and from which you do not wish to extract anything.--right? Something like this: `str <- c("name 0 = prDM: Pressure, Digiquartz [db]","blahblah X = blaBLah: blahblah usw.","name 1 = t090C: Temperature [ITS-90, deg C]")` – Chris Ruehlemann Apr 21 '20 at 19:10
0

In Base R

test <- c("name 0 = prDM: Pressure, Digiquartz [db]","name 1 = t090C: Temperature [ITS-90, deg C]")

gsub("^name [0-9]+ = (.+):.+","\\1",test)

[1] "prDM"  "t090C"

explanation

^name [0-9]+ Searches for a the beginning of a string ^ with name folowed by any length of number

= (.+): any length + of any character . found between = and : are stored ( ) to be later recalled by \\1

Daniel O
  • 4,258
  • 6
  • 20