0

Hey so I have a tibble with head() printed like this:

# A tibble: 6 × 1
                                   id.make.model.year
                                             <chr>
1  27550?????AM General?????DJ Po Vehicle 2WD?????1984
2  28426?????AM General?????DJ Po Vehicle 2WD?????1984
3   27549?????AM General?????FJ8c Post Office?????1984
4   28425?????AM General?????FJ8c Post Office?????1984
5 1032?????AM General?????Post Office DJ5 2WD?????1985
6 1033?????AM General?????Post Office DJ8 2WD?????1985

with only one column. I want to seperate this into four columns with those four column names. I tried to use separate()

A %>% 
  separate(id.make.model.year,into=c("id","make"),sep="?????")

and

A %>% 
  separate(id.make.model.year,into=c("id","make"),sep="\\?????")

but they both return the following error:

Error in stringi::stri_split_regex(value, sep, n_max) : Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

Yet another try...:

A %>% 
  separate(id.make.model.year,into=c("id","make"),sep="[?????]")

which returns

# A tibble: 33,439 × 2
      id  make
*  <chr> <chr>
1  27550      
2  28426      
3  27549      
4  28425      
5   1032      
6   1033      
7   3347      
8  13309      
9  13310      
10 13311      
# ... with 33,429 more rows
Warning message:
Too many values at 33439 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ... 

I also tried dropping sep, but all the spaces are clearly counted as separators.

What's the right way to do this? Thanks in advance.

Konamiman
  • 49,681
  • 17
  • 108
  • 138
godric97
  • 33
  • 2
  • Please add the output of `dput(head(df))` in your question, to make this a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Anyway it just sounds like a simple regex question to me. – smci Nov 05 '16 at 09:06
  • The regex to match one question mark is `\?`, or `[?]`. However if you have five of them, `[?????]` still only one matches one occurrence of that character, just like `[aaaaa]` would only match one letter `a`, not five. So I think you want `\?{5}` or `[?]{5}` (or `\?\?\?\?\?` or `[?][?][?][?][?]`). Until you post data with `dput()` I can't confirm. – smci Nov 05 '16 at 09:30
  • By the way, if the '?????' came from `read.csv()` with a wrong Unicode encoding, or weird separator char(s), you might want to fix that. – smci Nov 05 '16 at 12:19

2 Answers2

2

The regex to match one question mark is \?, or [?]. However if you have five of them, [?????] still only one matches one occurrence of that character because [...] defines a character class. Just like [aaaaa] would only match one letter a, not five.

So to capture the five repetitions I think you want \?{5} or [?]{5} (or \?\?\?\?\? or [?][?][?][?][?]).

Until you post data with dput() I can't confirm.

smci
  • 32,567
  • 20
  • 113
  • 146
1

Here is one solution for you with the splitstackshape and data.table packages. You split the column with cSplit(). Since you want four columns, you want to specify direction = "wide" in the function. Once you create the four columns, you want to change the column names. I split the original column name using strsplit() and created four names that you want.

library(splitstackshape)
library(data.table)

mydf <- data.frame(id.make.model.year = c("27550?????AM General?????DJ Po Vehicle 2WD?????1984",
                                          "28426?????AM General?????DJ Po Vehicle 2WD?????1984"),
                   stringsAsFactors = F)

temp <- cSplit(mydf, splitCols = "id.make.model.year", sep = "?????", direction = "wide")
setnames(temp, unlist(strsplit(names(mydf), "[.]")))


#      id       make             model year
#1: 27550 AM General DJ Po Vehicle 2WD 1984
#2: 28426 AM General DJ Po Vehicle 2WD 1984
jazzurro
  • 23,179
  • 35
  • 66
  • 76