Seperation on letternumber value inside a column based on a letter

Question

I have a problem with separating inside a column

The data inside that column is a code for a device location looks like this SE005 or H0002 or MANA. the S is a device that is mobile and the letter after that states where it is used.

SE005 is the fifth a mobile device place E.

H0002 is immobile device number 2 at place H.

MANA is a device at one place

for my analysis in power BI I need to no how many articles where scanned on a device at a place I don’t care about which device exactly. Because Power BI can't summarise per placement of the device (because its a combined value inside a column) I want to split it up.

I want it to look like this.

               v1    v2     v3
SE005 becomes  S     E      005  # 2 separations

H0002 becomes        H      005  #1 separation and one deleted number

MANA                MANA           #R should not change this but is should be inside the same column as E and H

I have to apply this to 8 million rows. And I think it must be done in two or 3 steps first separate the letters from the numbers. And note that there are more letters than the one in the preview. But the arrangement is the same. Any help is appreciated.

EDIT

just want to split op the device column so power bi can work with it.

art <- c(1:100)
device <-c("SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005", 
       "SE05", "H005", "E003", "MANA", "J012", "X01", "W007", "MANA", "SE02", "H005", 
       "SE05", "H007", "E003", "MANA", "J012", "X02", "W007", "MANA", "SE02", "H005",
       "SE05", "H008", "E004", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005", 
      "SE05", "H005", "E003", "MANA", "J012", "X017", "W007", "MANA", "SE02", "H005",
       "SE05", "H0010", "E008", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
       "SE05", "H005", "E003", "MANA", "J012", "X009", "W007", "MANA", "SE02", "H005",
       "SE05", "H0010", "E0010", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
       "SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
       "SE05", "H009", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005")

ACCEPT <- as.data.frame(art)
ACCEPT$device <- device     




  head(BLABLA)
  Article device   V3       V4     
1 52290   SE05  20170223  162756  
2 52300   SE05  20170223  162758  
3 10090   SE05  20170223  162831  
4 10060   SE08  20170223  162834  
5 10070   SE08  20170223  162839  
6 10070   SE08  20170223  162859

Relevant: http://stackoverflow.com/questions/3003527/how-do-i-specify-a-dynamic-position-for-the-start-of-substring — , Mar 23 '17 at 12:24
Thanks but i've seen some of these already. But the use a sertain point to spilt up at a dot or a dash. and if i want to use this I need to type all letters that need to be separated right? — danielle, Mar 23 '17 at 12:46
What you need is `sub` and regular expressions. It's there in the first answer. You want to substitute the numbers into v3, letters before numbers into v1 and v2, and letters without numbers into v2 only. — , Mar 23 '17 at 12:49
the first awnser user \\ and dots andf other stuff if i fill that with letters it doesn't work. So I dont understand what you mean — danielle, Mar 23 '17 at 13:03
Have you considered looking at the help for `sub` and for `regexp`? — , Mar 23 '17 at 13:04
Thanks for youre sugestion but I can't seem to get it working. — danielle, Mar 23 '17 at 13:14
OK. So if you want help, could you post a [mcve]? See also http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — , Mar 23 '17 at 13:21
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138832/discussion-between-danielle-and-dash2). — danielle, Mar 23 '17 at 14:15

JanLauGe · Accepted Answer · 2017-03-24T13:25:17.720

Try this site to get a better understanding of regexp and how you can apply it in your case. Without a reproducible example it is difficult to understand your specific situation and the edge cases you might come across. Hopefully my example below will help to get you started:

EDIT: Changed my answer to use your example dataset

art <- c(1:100)
device <-c("SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005", 
           "SE05", "H005", "E003", "MANA", "J012", "X01", "W007", "MANA", "SE02", "H005", 
           "SE05", "H007", "E003", "MANA", "J012", "X02", "W007", "MANA", "SE02", "H005",
           "SE05", "H008", "E004", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005", 
           "SE05", "H005", "E003", "MANA", "J012", "X017", "W007", "MANA", "SE02", "H005",
           "SE05", "H0010", "E008", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
           "SE05", "H005", "E003", "MANA", "J012", "X009", "W007", "MANA", "SE02", "H005",
           "SE05", "H0010", "E0010", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
           "SE05", "H005", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005",
           "SE05", "H009", "E003", "MANA", "J012", "X021", "W007", "MANA", "SE02", "H005")

ACCEPT <- as.data.frame(art)
ACCEPT$device <- device 

library(tidyverse)
library(magrittr)
library(stringr)

# Find mobile devices
# '^' for start of string
# '[\\D]' for any non-numeric
# '{2}' for exactly two of them
ACCEPT %<>% mutate(mobile = str_detect(device, pattern = '^[\\D]{2}[\\d]{1}'))
# Now looking for exactly one letter at the start,
# followed by a number
ACCEPT %<>% mutate(immobile = str_detect(device, pattern = '^[\\D]{1}[\\d]{1}'))
# Finally, look for "no numbers"
# (alternatively, if all places have the same value, '== "MANA"' would do)
ACCEPT %<>% mutate(place = !str_detect(device, pattern = '\\d'))

# Split and process device types individually
bind_rows(ACCEPT %>%
    filter(mobile) %>%
    mutate(v1 = str_extract(device, pattern = '[^\\d]{1}'),
         v2 = str_sub(device, start = 2, end = 2),
         v3 = str_extract(device, pattern = '\\d{1,9}')),
  ACCEPT %>%
    filter(immobile) %>%
    mutate(v1 = '',
           v2 = str_sub(device, start = 1, end = 1),
           v3 = str_extract(device, pattern = '\\d{1,9}')),
  ACCEPT %>%
    filter(place) %>%
    mutate(v1 = '',
           v2 = device,
           v3 = '')) %>%
  arrange(art) %>%
  select(art, v1, v2, v3)

I don't know why but the end result has less records with this code. — danielle, Mar 23 '17 at 15:29
Oups, you are right! I had a bug in `ACCEPT %<>% mutate(mobile = str_detect(device, pattern = '^[\\D]{2}'))` which should instead be `ACCEPT %<>% mutate(mobile = str_detect(device, pattern = '^[\\D]{2}[\\d]{1}'))`. Answer updated accordingly — JanLauGe, Mar 23 '17 at 15:46
I'm not losing code anymore that's perfect. but I'm not looking to assing them mobile inmobile and place but if thats needed to get what i want that's fine. I'm still looking for the separation of the total code SE02 in S, E and 02. what your first code did but I lost records with that. — danielle, Mar 24 '17 at 11:15
I think pre-computing the device types by which to split the processing will scale better to your larger dataset. You can remove the superfluous columns by adding `select(art, v1, v2, v3)` at the end. I will add that to the code in my answer now. Regarding splitting SE02 in S, E and 02: This works for me from the example above. What exactly is not the way you want it? — JanLauGe, Mar 24 '17 at 13:29
Does this now do what yo need? If so, please accept it as the answer to your question :) — JanLauGe, Mar 27 '17 at 09:02

score 0 · Answer 2 · answered Mar 23 '17 at 15:07

0

Here is a slightly shorter version which doesn't use dplyr.

# v2 gets all of 'device' so long as this is entirely alphabetical:
ACCEPT$v2 <- ifelse(grepl('^[A-Z]+$', ACCEPT$device), ACCEPT$device, NA) 

# v3 gets the number, if there is one - we check by seeing if v2 is NA
ACCEPT$v3 <- ifelse(is.na(ACCEPT$v2), sub('\\D+(\\d+)', '\\1', ACCEPT$device), NA)

# now v1 and v2 will get the first two letters, 
# but only if v2 hasn't already been filled out:
ACCEPT$v1[is.na(ACCEPT$v2)] <- substr(ACCEPT$device[is.na(ACCEPT$v2)], 1, 1)
ACCEPT$v2[is.na(ACCEPT$v2)] <- substr(ACCEPT$device[is.na(ACCEPT$v2)], 2, 2)

answered Mar 23 '17 at 15:07

No data gets lost but it goes wrong because only "S" is mobile en should be in a seperate column instead of "E". with youre code E is separate en S is with al other immobile places. – danielle Mar 23 '17 at 15:22
ah, so only S should be in V1? And also H0002 becomes H005? Where does that 005 come from? Or is there a typo in your question and you meant it becomes H002? – Mar 23 '17 at 16:51
Yes S should be V1 (states that its a device that is mobile, and is used at place x which is in the second letter of a code with a S) , al other letters (places) V2 and the number V3. Yess it was a typo in my question. H0002 should be H and 002. – danielle Mar 24 '17 at 11:19

Seperation on letternumber value inside a column based on a letter

2 Answers2