10

I have a set of strings that are file names. I want to extract all characters after the # symbol but before the file extension. For example, one of the file names is:

HelloWorld#you.txt

I would want to return the stringyou

Here is my code:

    hashPos = grep("#", name, fixed=TRUE)
    dotPos = length(name)-3
    finalText = substring(name, hashPos, dotPos)

I read online that grep is supposed to return the index where the first parameter occurs (in this case the # symbol). So, I was expecting the above to work but it does not.

Or how would I use a regular expression to extract this string? Also, what happens when the string does not have a # symbol? Would the function return a special value such as -1?

CodeKingPlusPlus
  • 15,383
  • 51
  • 135
  • 216

6 Answers6

18

Here is a one-liner solution

gsub(".*\\#(.*)\\..*", "\\1", c("HelloWorld#you.txt"))

Output:

you

To explain the code, it matches everything up to # and then extracts all word characters up to ., so the final output will be the in-between string which what you are looking for.

Edit:

The above solution matches file name up to the last . i.e. allow file name to have multiple dots. If you want to extract the name up to the first . you can use the regex .*\\#(\\w*)\\..* instead.

iTech
  • 18,192
  • 4
  • 57
  • 80
  • removed my erroneous comment. – CHP Mar 15 '13 at 04:05
  • 2
    If a reader is still confused, they can check the table at the bottom of this page : http://www.endmemo.com/program/R/gsub.php. That helped me a lot. – Ehsan88 May 12 '15 at 11:51
  • The endmemo post was very helpful. Also, I thought @Chinmay Patil's answer below superior in that it handles multiple ".". – Stan Feb 08 '18 at 14:42
6

strapplyc To extract the word immediately after # try this using strapplyc in the gsubfn package:

> library(gsubfn)
>
> strapplyc("HelloWorld#you.txt", "#(\\w+)")[[1]]
[1] "you"

or this which allows the file name to contain dots:

> strapplyc("HelloWorld#you.txt", "#(.*)\\.")[[1]]
[1] "you"

file_path_sans_ext Another more filename oriented approach using the tools package (which comes bundled with R so no extra packages need be installed) is as follows:

> library(tools)
>
> file_path_sans_ext(sub(".*#", "", "HelloWorld#you.txt")) 
[1] "you"

ADDED: additional solutions

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
4

You can use gsub. Advantage of this is you can match multiple .s until the last one.

> s <- 'HelloWorld#you.and.me.txt'
> gsub('.*#(.*)\\.+.*','\\1', s)
[1] "you.and.me"
CHP
  • 16,981
  • 4
  • 38
  • 57
2

grep returns the index in terms of item numbers, not character placement (HelloWorld#you.txt has only one item, so it should return 1).

You want regexpr instead, it counts characters rather than items.

hashPos = regexpr("#", name, fixed=TRUE) + 1
dotPos = length(name)-3
finalText = substring(name, hashPos, dotPos)
Señor O
  • 17,049
  • 2
  • 45
  • 47
2

This solution is easy for those not wanting to learn regex but doesn't align with the poster's intent (more for future searchers). This approach covers the case when you have no # as the function will return character(0).

library(qdap)
x <- c("HelloWorld#you.txt", "HelloWorldyou.txt")
genXtract(x, "#", ".")

Yields:

> genXtract(x, "#", ".")
$`#  :  right1`
[1] "you"

$`#  :  right2`
character(0)

Though I think there's a bug in the label but not the actual return values.

EDIT: This is indeed a bug that has been fixed in the development version. Output with devel. ver.:

> genXtract(x, "#", ".")
$`#  :  .1`
[1] "you"

$`#  :  .2`
character(0)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
0

I didn't like most of the solutions here so far. Either they use overly complicated regexpes or additional packages, which is unnecessary IMHO. I think this here is much more to the point and more reusable

# Function that finds a match and returns the matched string
getMatch = function(rexp, str) regmatches(str, regexpr(rexp, str))

filename = "HelloWorld#you.txt"

# The regexp here is simply the hash sign plus everything 
# following that is not a dot
getMatch("#[^.]*", filename)

Returns #you as it should (you can remove the # with the substr function). If the filename does not contain a hash sign, the empty string is returned.

Elmar Zander
  • 1,338
  • 17
  • 32