R Extract partially matching string

Question

I have a question about extracting a part of a string from several files that has these rows:

units = specified - name 0 = prDM: Pressure, Digiquartz [db] - name 1 = t090C: Temperature [ITS-90, deg C] - name 2 = c0S/m: Conductivity [S/m] - name 3 = t190C:Temperature, 2 [ITS-90, deg C] - name 4 = c1S/m: Conductivity, 2 [S/m] - name 5 = flSP: Fluorescence, Seapoint - name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l] - name 7 = altM: Altimeter [m] - name 8 = sal00: Salinity, Practical [PSU] - name 9 = sal11: Salinity, Practical, 2 [PSU] - span 0 = 1.000, 42.000

I need to extract only the information of the columns that start with "name" and extract everything between = and: . For example, in the row "name 0 = prDM: Pressure, Digiquartz [db]" the desired result will be prDM. Some files have different number of "name"rows (i.e. this example has 13 rows but other files has 16, and the number varies), so I want it to be as general as I can so I can allways extract the right strings independently the number of rows.Rows starts with # and a space before name. I have tried this code but it only extract the first row. Can you please help me with this? Many thanks!

CNV<-NULL
for (i in 1:nro.files){
x <- readLines(all.files[i])
name.col<-grep("^\\# name", x) 
df <- data.table::fread(text = x[name.col])
CNV[[i]]<-df
}

Try [`name[^=]+=\s([^:]+)`](https://regex101.com/r/ED2jSB/1) — Srdjan M., Apr 21 '20 at 17:32
I get this Error: '\s' is an unrecognized escape in character string starting "" name[^=]+=\s" — Carla Berghoff, Apr 21 '20 at 17:40
Your string in file is a single line? All these rows are one after another? — Srdjan M., Apr 21 '20 at 18:02
I've updated my answer as to your comment about only wanting the `name =` strings taken. — Daniel O, Apr 21 '20 at 18:04
here is one file as an example: https://www.dropbox.com/s/uq402dl22lingfo/CT01.cnv?dl=0 — Carla Berghoff, Apr 21 '20 at 18:30
After you extract all matches from `str_match_all` you need to make another `for loop` were you will read data from your table `data.table::fread(text = x[result[[1]][[i,2]]])` [demo](https://rextester.com/YPVYMO85771) — Srdjan M., Apr 21 '20 at 20:07

Greg · Accepted Answer · 2020-04-21T20:27:51.793

0

using stringr and the regex pattern "name \\d+ = (.*?):" which means in words "name followed by one or more digits followed by an equals sign followed by a space followed by a captured group containing any character (the period) zero or more times (the *) followed by a colon".

   library(stringr)
    strings <- c("name 0 = prDM: Pressure, Digiquartz [db]",
    "name 1 = t090C: Temperature [ITS-90, deg C]",
    "name 2 = c0S/m: Conductivity [S/m]",
    "name 3 = t190C:Temperature, 2 [ITS-90, deg C]",
    "name 4 = c1S/m: Conductivity, 2 [S/m]",
    "name 5 = flSP: Fluorescence, Seapoint",
    "name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l]",
    "name 7 = altM: Altimeter [m]",
    "name 8 = sal00: Salinity, Practical [PSU]",
    "name 9 = sal11: Salinity, Practical, 2 [PSU]")

    result <- str_match(strings, "name \\d+ = (.*):")
    result[,2]
 [1] "prDM"       "t090C"      "c0S/m"      "t190C"      "c1S/m"      "flSP"       "sbeox0ML/L"
 [8] "altM"       "sal00"      "sal11"

Or if you prefer base

pattern = "name \\d+ = (.*):"
result <- regmatches(strings, regexec(pattern, strings))
sapply(result, "[[", 2)

 [1] "prDM"       "t090C"      "c0S/m"      "t190C"      "c1S/m"      "flSP"       "sbeox0ML/L"
 [8] "altM"       "sal00"      "sal11"

edited Apr 21 '20 at 20:27

answered Apr 21 '20 at 17:33

Greg

3,570
5
18
31

that can be usefull, thanks! but my problem is that I have other columns that has = and : but i only need the ones that has the string "name" following by a number – Carla Berghoff Apr 21 '20 at 17:50
@CarlaBerghoff I'm not sure what your question is - please provide a reproducible example and your expected output per https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example If you just need to make sure the original string contains "name" you can add it to the regex `result <- str_match(strings, "name \\d = (.*):")` – Greg Apr 21 '20 at 18:20
here is one file as an example: https://www.dropbox.com/s/uq402dl22lingfo/CT01.cnv?dl=0 – Carla Berghoff Apr 21 '20 at 18:29
@CarlaBerghoff please read the first answer in the linked post in my comment above and produce an example like the post describes. – Greg Apr 21 '20 at 18:32
it does not work... I get the full first row that start with "name and get this warning: 1: In data.table::fread(text = x[result]) : Stopped early on line 3. Expected 2 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<# name 2 = c0S/m: Conductivity [S/m]>> – Carla Berghoff Apr 21 '20 at 19:45

Chris Ruehlemann · Answer 2 · 2020-04-21T17:45:16.790

0

Use str_extract from package stringr and positive lookahead and lookbehind:

str <- "name 0 = prDM: Pressure, Digiquartz [db]"

str_extract(str, "(?<== ).*(?=:)")
[1] "prDM"

Explanation:

(?<== )if you see =followed by white space on the left (lookbehind)

.* match anything until ...

(?=:)... you see a colon on the right (lookahead)

edited Apr 21 '20 at 17:45

answered Apr 21 '20 at 17:37

Chris Ruehlemann

20,321
4
12
34

since other columns contains = and :, and I only want to extract the string in the ones that start with name, could you help me to adapt the code? I am so lost! Thanks! – Carla Berghoff Apr 21 '20 at 18:01
Can you add a few examples of these "other" strings? – Chris Ruehlemann Apr 21 '20 at 19:04
If I understand you correctly you have other strings that also contain `=` and `:`but which do not start with `name`and from which you do not wish to extract anything.--right? Something like this: `str <- c("name 0 = prDM: Pressure, Digiquartz [db]","blahblah X = blaBLah: blahblah usw.","name 1 = t090C: Temperature [ITS-90, deg C]")` – Chris Ruehlemann Apr 21 '20 at 19:10

Daniel O · Answer 3 · 2020-04-21T18:09:55.090

0

In Base R

test <- c("name 0 = prDM: Pressure, Digiquartz [db]","name 1 = t090C: Temperature [ITS-90, deg C]")

gsub("^name [0-9]+ = (.+):.+","\\1",test)

[1] "prDM"  "t090C"

explanation

^name [0-9]+ Searches for a the beginning of a string ^ with name folowed by any length of number

= (.+): any length + of any character . found between = and : are stored ( ) to be later recalled by \\1

edited Apr 21 '20 at 18:09

answered Apr 21 '20 at 17:38

Daniel O

4,258
6
20

R Extract partially matching string

3 Answers3