stringr to extract a column of text

Question

I have a string that looks like this:

t2 <- "============================================
                       Model 1    Model 2   
--------------------------------------------
education               3.66 ***    2.80 ***
                       (0.65)      (0.59)   
income                  1.04 ***    0.85 ***
                       (0.26)      (0.23)   
type: blue collar      -5.91      -27.55 ***
                       (3.94)      (5.41)   
type: white collar     -8.82 **   -24.12 ***
                       (2.79)      (5.35)   
income x blue collar                3.01 ***
                                   (0.58)   
income x white collar               1.91 *  
                                   (0.81)   
prop. female            0.01        0.08 *  
                       (0.03)      (0.03)   
--------------------------------------------
R^2                     0.83        0.87    
Adj. R^2                0.83        0.86    
Num. obs.              98          98       
============================================
*** p < 0.001, ** p < 0.01, * p < 0.05"

and I'm trying to extract the left hand column so that I get a vector that looks like this:

education
income
type: blue collar
type: white collar
income x blue collar
income x white collar
prop. female

I'm new to regex and stringr, and I'm trying to extract the words that follow a linebreak:

library(stringr)
covariates <- str_extract_all(t2, "\n\\w+")
covariates

which is getting me a bit closer:

[1] "\neducation" "\nincome"    "\ntype"      "\ntype"      "\nincome"    "\nincome"    "\nprop"      "\nR"        
 [9] "\nAdj"       "\nNum"

but I can't work out how to capture the entire column of text eg, getting the full "type: blue collar", instead of "\ntype".

@WiktorStribiżew Yes, exactly, thank you. Sorry I wasn't clearer. — Jeremy K., Aug 22 '19 at 09:37

Wiktor Stribiżew · Accepted Answer · 2019-08-22T09:57:50.103

2

You may use

covariates <- str_extract_all(
        str_match(t2, "(?ms)^-{3,}\n(.*?)\n-{3,}$")[,2], 
        "(?m)^\\S.*?(?=\\h{2})"
)

Or, to make it work much faster, use these unrolled patterns:

covariates <- str_extract_all(
        str_match(t2, "(?m)^-{3,}\n(.*(?:\n(?!-{3,}$).*)*)\n-{3,}$")[,2],
        "(?m)^\\S\\H*(?:\\h(?!\\h)\\H*)*"
)

With str_match(t2, "(?ms)^-{3,}\n(.*?)\n-{3,}$")[,2], you extract all text between two lines that are made of 3 or more hyphens. Here are that pattern details:

(?ms) - multiline (making ^ match start of a line and $ match end of line) and singleline/dotall (making . match line breaks, too) modes on -
^ - start of a line
-{3,} - three or more hyphens
\n - a newline
(.*?) - Group 1: any 0+ chars but as few as possible
\n - a newline
-{3,} - three or more hyphens
$ - end of line.

The (?m)^\\S.*?(?=\\h{2}) is used later on that part of the string and matches

(?m) - multiline mode on
^ - start of a line
\\S - a non-whitespace char
.*? - any 0+ chars other than line break chars, as few as possible
(?=\\h{2}) - immediately to the right of the current location, there must be 2 horizontal whitespaces.

edited Aug 22 '19 at 09:57

answered Aug 22 '19 at 09:44

Wiktor Stribiżew

607,720
39
448
563

The edit where you explain how each part of the regex works is invaluable for people learning like me. Do you recommend a particular resource for learning regex? – Jeremy K. Aug 22 '19 at 09:48
1

@JeremyK.`stringr` uses ICU regex library, and there are quite few resources for it. Use [ICU regex docs](http://userguide.icu-project.org/strings/regexp) as primary reference. ICU is close to Java regex, so a lot will look similar. – Wiktor Stribiżew Aug 22 '19 at 09:52
1

@JeremyK. I added a much better alternative without explanation, because these patterns match exactly the same strings as those above, but are faster. If you need to know more details, please read about [unrolling the loop in regex](http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop). – Wiktor Stribiżew Aug 22 '19 at 10:00
Can I please ask one more question? Why does `str_match(t2, "(?ms)^-{3,}\n(.*?)\n-{3,}$")` return two different matches (meaning that we then need to do `str_match(t2, "(?ms)^-{3,}\n(.*?)\n-{3,}$")[,2]` to get just the relevant one. If we search for `^-{3,}` I would have thought only the [,1] would be returned, as the [,2] doesn't even begin with `---`. Thank you again for your invaluable help. I've been reading up and working through examples on this. – Jeremy K. Aug 23 '19 at 00:07
1

@JeremyK. `str_match` searches for a single regex match occurrence in the input string. Each match may contain *captured* substrings, those matched with parenthesized regex parts (unlike `str_extract` that drops the captures). As there is one, `(.*?)`, the output contains a matrix with two columns and as we need the second one, I used `[,2]`. You may replace the `str_match` with 2 `sub` calls: `sub("\n-{3,}(?:\n.*)?$", "", sub("^.*?\n-{3,}\n", "", t2))` – Wiktor Stribiżew Aug 23 '19 at 06:46

stringr to extract a column of text

1 Answers1