Regular expressions, stringr - have the regex, can't get it to work in R

Question

I have a data frame with a field called "full.path.name" This contains things like s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx

01 GROUP is a pattern of variable size in the whole string.

I would like to add a new field onto the data frame called "short.path" and it would contain things like

s:///01 GROUP

s:///02 GROUP LONGER NAME

I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.

This gives me the file extension

sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))

I went to https://www.regextester.com/ and got this

 s:///*.[^/]*

as the regex to use so I tried it below

sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))

What I thought I would get is a new field on my data frame containing 01 GROUP etc I get NA

When I try this

sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")

Gives me S

Where am I going wrong? When I use: https://regexr.com/ I get \d* [A-Z]* [A-Z]*[^/]

How do I put that into

sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])

And make things work?

EDIT: There are two solutions here. The reason the solutions didn't work at first was because

  sfiles$Full.path.name

was >255 in some cases.

What I did: To make g_t_m's regex work

 library(tidyverse)
 #read the file
 sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)

 # add a field to calculate path length and filter out
 sfiles$file_path_length <- str_length(sfiles$Full.path.name)
 sfiles<-sfiles%>%filter(file_path_length <=255)

 # then use str_replace to take out the full path name and leave only the 
   top 
 # folder names

 sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, " 
 (^.+?/[^/]+?)/.+$", "\\1"))
 levels(sfiles$file_path_short)

[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7

I think it was the full.path.name field that was causing problems. To make Wiktor's answer work I did this:

#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)       
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, " 
(^.+?/[^/]+?)/.+$", "\\1")

Try `sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")` or `"(?<=^s:///)[^/]+"` if `s:///` should not be returned. — Wiktor Stribiżew, Jan 24 '19 at 12:02
When I try "^s:///[^/]+" I get NA when I try "(?<=^s:///)[^/]+" I also get NA. The full path name is in as a character/string. — damo, Jan 24 '19 at 13:48
Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Wiktor Stribiżew, Jan 24 '19 at 14:01
I think it had something to do with file name length. There were some very long strings in there, I filtered out >255. I ran you regex bit, it worked! — damo, Jan 24 '19 at 14:24
I want to be able to credit both answers. How can I do that? Both have worked. The regex you gave me works as does the one below. What's the correct etiquette here? — damo, Jan 24 '19 at 15:34
I posted my answer below. An upvote would suffice if you are using the other solution. Just make sure you accept the answer with the solution that works best for you. Please make your accepting decision once. — Wiktor Stribiżew, Jan 24 '19 at 15:38

score 1 · Answer 1 · answered Jan 24 '19 at 12:24

1

Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:

library(stringr)

df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
                                    "s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)

df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")

df$file_type
[1] "docx" "pdf"

Then, the following code should give you your short name:

df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")

df
                                              full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx      docx   s:///01 GROUP
2  s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf       pdf   s:///01 GROUP

answered Jan 24 '19 at 12:24

g_t_m

704
4
9

Ok. I've reimported the csv used to generate the dataframe, I made sure I set stringsAsFactors = F. When I tried df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1") I got Error in basename(sfiles$Full.path.name) : path too long. Clearly, there's some very long file names in the field. – damo Jan 24 '19 at 13:42
Actually. It works. I skipped the first suggestion of using a different regex, because the way people have been naming files is awful and there are things that made no sense. I'll work through that somehow. I did this df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1") and it worked. It pulled out the seven folders. – damo Jan 24 '19 at 14:20
I'll mark this as a solution and put the solution in the question. – damo Jan 24 '19 at 14:20

score 1 · Accepted Answer · answered Jan 24 '19 at 15:36

You may use a mere

sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")

If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:

"(?<=^s:///)[^/]+"

See the regex demo

Details

^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).

Thanks Wiktor. This also works, without the problems of the long path name. And explains regex patterns. I have a feeling I need to do more of this. — damo, Jan 24 '19 at 15:41

Regular expressions, stringr - have the regex, can't get it to work in R

2 Answers2