2

I'm trying to extract the "Number" of "Humans" in the string below, for example:

string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")

The position of the text in the string will constantly change, so I need R to search the string and find "Species|Human|Number|" and return 1.

Apologies if this is a duplicate of another thread, but I've looked here (extract a substring in R according to a pattern) and here (R extract part of string). But I'm not having any luck.

Any ideas?

Community
  • 1
  • 1
Ross
  • 359
  • 2
  • 11

2 Answers2

2

Use a capturing approach - capture 1 or more digits (\d+) after the known substring (just escape the | symbols):

> string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
> pattern = "Species\\|Human\\|Number\\|(\\d+)"
> unlist(regmatches(string,regexec(pattern,string)))[2]
[1] "1"

A variation is to use a PCRE regex with regmatches/regexpr

> pattern="(?<=Species\\|Human\\|Number\\|)\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"

Here, the left side context is put inside a non-consuming pattern, a positive lookbehind, (?<=...).

The same functionality can be achieved with \K operator:

> pattern="Species\\|Human\\|Number\\|\\K\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

Simplest way I can think of:

as.integer(gsub("^.+Species\\|Human\\|Number\\|(\\d+).+$", "\\1", string))

It will introduce NAs where there is no mention of Speces|Human|Number. Also, there will be artefacts if any of the strings is a number (but I assume that this won't be an issue)

MrMobster
  • 1,851
  • 16
  • 25