3

I found a very strange behavior in strsplit(). It's similar to this question, however I would love to know why it is returning an empty element in the first place. Does someone know?

unlist(strsplit("88F5T7F4T13F", "\\d+"))  
[1] ""  "F" "T" "F" "T" "F"

Since I use that string vor reproducing a long logical vector (88*FALSE 5*TRUE 7*FALSE 4*TRUE 13*FALSE) I have to trust it...

Answer unlist(strsplit("88F5T7F4T13F", "\\d+"))[-1] works, but is it robust?

Community
  • 1
  • 1
Jeremy
  • 328
  • 1
  • 8
  • This is expected behavior and is explained in the documentation. – Rich Scriven Feb 27 '17 at 00:37
  • 2
    The empty element appears since there are digits at the start. Since you split at digits, the first split occurs right between start of string and the first `F` and that empty string at the string start is added to the resulting list. You may use `unlist(strsplit(sub("^\\d+", "", "88F5T7F4T13F"), "\\d+"))` or your solution. – Wiktor Stribiżew Feb 27 '17 at 00:39
  • In order to remove the empty elements in a more systematic way, you can also use num_split = unlist(strsplit("88F5T7F4T13F", "\\d+")); num_split = num_split[num_split != ""] – Andy McKenzie Feb 27 '17 at 00:53
  • Thank you very much. I'll test both :-) – Jeremy Feb 27 '17 at 00:55
  • Well, you may also rely on *stringr*: `library(stringr)` -> `str_extract_all(s, "\\D+")`. Or base R: `regmatches(s, gregexpr("\\D+", s))`. – Wiktor Stribiżew Feb 27 '17 at 01:10
  • Could [this](http://stackoverflow.com/questions/20891104/r-how-to-avoid-strsplit-hiccuping-on-empty-vectors-when-splitting-text) help you? – Thomas Ayoub Feb 27 '17 at 13:09
  • 1
    @WiktorStribiżew explanation really helped! Since I now understand `strsplit` I trust the `[-1]`. Thanks a lot! – Jeremy Feb 27 '17 at 15:24
  • I added an answer then. – Wiktor Stribiżew Feb 27 '17 at 16:10

1 Answers1

1

The empty element appears since there are digits at the start. Since you split at digits, the first split occurs right between start of string and the first F and that empty string at the string start is added to the resulting list.

You may use your own solution since it is already working well. If you are interested in alternative solutions, see below:

unlist(strsplit(sub("^\\d+", "", "88F5T7F4T13F"), "\\d+"))

It makes the empty element in the resulting split disapper since the sub with ^\d+ pattern removes all leading digits (^ is the start of string and \d+ matches 1 or more digits). However, it is not robust, since it uses 2 regexps.

library(stringr)
res = str_extract_all(s, "\\D+")

This only requires one matching regex, \D+ - 1 or more non-digit symbols, and one external library.

If you want to do a similar thing with base R, use regmatches with gregexpr:

regmatches(s, gregexpr("\\D+", s))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563