1

I have a Text column with thousands of rows of paragraphs, and I want to extract the values of "Capacity > x%". The operation sign can be >,<,=, ~... I basically need the operation sign and integer value (e.g. <40%) and place it in a column next to the it, same row. I have tried, removing before/after text, gsub, grep, grepl, string_extract, etc. None with good results. I am not sure if the percentage sign is throwing it or I am just not getting the code structure. Appreciate your assistance please. Here are some codes I have tried (aa is the df, TEXT is col name):

str_extract(string =aa$TEXT, pattern = perl("(?<=LVEF).*(?=%)"))

gsub(".*[Capacity]([^.]+)[%].*", "\\1", aa$TEXT)

genXtract(aa$TEXT, "Capacity", "%")

gsub("%.*$", "%", aa$TEXT)

grep("^Capacity.*%$",aa$TEXT)
Enrico Cortinovis
  • 811
  • 3
  • 8
  • 31
Shawn
  • 149
  • 1
  • 3
  • 9
  • Can you edit your question and provide a small [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your dataset ? – dc37 Nov 30 '19 at 05:33
  • Is the character vector always the same number of characters? IF so, you may be able to use `substr()` – Dylan_Gomes Nov 30 '19 at 06:35

2 Answers2

2

Since you did not provide a reproducible example, I created one myself and used it here.

We can use sub to extract everything after "Capacity" until a number and % sign.

sub(".*Capacity(.*\\d+%).*", "\\1", aa$TEXT)
#[1] " > 10%"  " < 40%"  " ~ 230%"

Or with str_extract

stringr::str_extract(aa$TEXT, "(?<=Capacity).*\\d+%")

data

aa <- data.frame(TEXT = c("This is a temp text, Capacity > 10%", 
                    "This is a temp text, Capacity < 40%", 
                    "Capacity ~ 230% more text  ahead"), stringsAsFactors = FALSE)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you, as your sub code captured most of what I needed, but missed it if there was an = sign vs < or >. I like the format where it enters a NA if the condition is not met. Appreciate it. – Shawn Nov 30 '19 at 20:51
  • @Shawn The solution I posted above is based on my understanding of your data which may or may not be true and that is the reason why you should share a reproducible example which represents your actual data so such edge cases are not missed. – Ronak Shah Dec 01 '19 at 02:31
  • Oh, I am fully aware of my lack of submission of an example some nuances are definitely lost. I do appreciate your assistance and input and tried your suggested codes to seem the pro/cons. Honestly, I have reviewed the instructions on providing sample data/examples several times and it doesn't seem to work for my examples. I appreciate any suggestions you may offer off line, if you have time only. Thank you again for your support. – Shawn Dec 01 '19 at 18:09
1

gsub solution

I think your gsub solution was pretty close, but didn't bring along the percentage sign as it's outside the brackets. So something like this should work (the result is assigned to the capacity column):

aa$capacity <- gsub(".*[Capacity]([^.]+%).*", "\\1", aa$TEXT)

Alternative method

The gsub approach will match the whole string when there is no operator match. To avoid this, we can use the stringr package with a more specific regular expression:

library(magrittr)
library(dplyr)
library(stringr)

aa %>% 
  mutate(capacity = str_extract(TEXT, "(?<=Capacity\\s)\\W\\s?\\d+\\s?%")) %>%
  mutate(Capacity = str_squish(Capacity)) # Remove excess white space

This code will give NA when there is no match, which I believe is your desired behaviour.

Callum Savage
  • 341
  • 2
  • 7
  • Your gsub code captured what I needed. Is there a way to place NA in case the condition is not met, instead of returning the entire text in the cell, please? also how can I place the results in a new column next to where this text is located? Appreciate your assistance very much. – Shawn Nov 30 '19 at 20:52
  • Edited my answer to reassign the result to the data frame. – Callum Savage Dec 01 '19 at 08:01
  • Terrific! Thank you again for your assistance. – Shawn Dec 01 '19 at 18:40