2

I found a few questions heading in this direction, but I could not apply the solutions to my specific problem: I have a quite a messy column of a dataframe with addresses. This means, there can be empty cells, numbers, numbers and text combined - and there can be one or more special characters in between.

In a first step, I want to split all values at the first special character. I tried various options that work partially. However, the problem seems to be that some cells don't contain any special characters - causing an error in the function.

For example, the following code puts only the special character in the new column b, but does not really split the columns:

df <- df %>% 
separate(address, into = c("a", "b"), sep = "[^[:punct:]]+", remove = FALSE)

So, what ideally I want to achieve is the following: If there is a special character in the cell, split it at the first special character, everything left of the first special character in column a, everything right in column b. If there is no special character, put the whole thing in column a and NA in column b.

Do I have to wrap my code in an ifelse-statement? Or are there any other suggestions?

Thanks!

Edit: as requested, some sample data:

library(dplyr)

test <- as.data.frame(c("2", "97/7", "17/7-8", "7E", "800E/7", "17", "", "0", "2/15", "17+18", "17/7/8", "19", "2/2/4", "9-7/8")) %>% 
  rename(address = 1)
jay.sf
  • 60,139
  • 8
  • 53
  • 110
Jakob
  • 163
  • 6
  • Usually the term "special character" refers to "has meaning in code", see [this Q&A](https://stackoverflow.com/q/27721008/6574038). You may want to give an example of your data frame to get better help. Please be sure to use `dput` or consider [this guide](https://stackoverflow.com/a/5963610/6574038). – jay.sf Dec 17 '20 at 09:49

3 Answers3

1

Using strsplit with your regular expression works. We may put it in an lapply to loop over the columns. Using `length<-()` we adjust the lengths of the list elements to their maximum to be able to create a data.frame.

r <- el(lapply(test, strsplit, "[[:punct:]]", perl=TRUE))
as.data.frame(t(sapply(r, `length<-`, max(lengths(r)))))
#      V1   V2   V3
# 1     2 <NA> <NA>
# 2    97    7 <NA>
# 3    17    7    8
# 4    7E <NA> <NA>
# 5  800E    7 <NA>
# 6    17 <NA> <NA>
# 7  <NA> <NA> <NA>
# 8     0 <NA> <NA>
# 9     2   15 <NA>
# 10   17   18 <NA>
# 11   17    7    8
# 12   19 <NA> <NA>
# 13    2    2    4
# 14    9    7    8

Similarly we can do it at the first occurrence: We may use sub to replace the first occurrence with something, say "£" and then split it there.

test[] <- lapply(test, sub, pat="[[:punct:]]", rep="£")
r <- el(lapply(test, strsplit, "£"))
as.data.frame(t(sapply(r, `length<-`, max(lengths(r)))))
#      V1   V2
# 1     2 <NA>
# 2    97    7
# 3    17  7-8
# 4    7E <NA>
# 5  800E    7
# 6    17 <NA>
# 7  <NA> <NA>
# 8     0 <NA>
# 9     2   15
# 10   17   18
# 11   17  7/8
# 12   19 <NA>
# 13    2  2/4
# 14    9  7/8
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thanks - goes in the right direction, but it splits all special characters. In this first stept, I want to split only at the first special character into two columns (`column a` is usaully the hous number, `column b` a combination for the exact door). – Jakob Dec 17 '20 at 10:28
  • Thanks - this works! Interesting solution, many new R things I need to look into! :) – Jakob Dec 17 '20 at 10:48
1

You can use separate using extra = 'merge' and fill = 'right'

tidyr::separate(test, address, into = c("a", "b"), '[[:punct:]]', 
                extra = 'merge', fill = 'right', remove = FALSE)

#   address    a    b
#1        2    2 <NA>
#2     97/7   97    7
#3   17/7-8   17  7-8
#4       7E   7E <NA>
#5   800E/7 800E    7
#6       17   17 <NA>
#7               <NA>
#8        0    0 <NA>
#9     2/15    2   15
#10   17+18   17   18
#11  17/7/8   17  7/8
#12      19   19 <NA>
#13   2/2/4    2  2/4
#14   9-7/8    9  7/8
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Perfect! I could not figure out what the `extra`-argument was supposed to do using the help provided - but it seems to be exactly what I was looking for. – Jakob Dec 17 '20 at 10:49
0

Does this work:

library(dplyr)
library(tidyr)
df %>% separate(c1, c('a','b'), sep = '[^A-z0-9_]')
     a    b
1   ab   cd
2   pq   rj
3   xy    z
4 abcd <NA>

Data used:

df
     c1
1 ab$cd
2 pq%rj
3  xy#z
4  abcd
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • Works for most columns, but not for those that have more than one special character. For these cells, everything from the second special character onwards is cut off. Could you also explain why the split works if you use in letters `A-z` and digits `0-9_`? Thanks! – Jakob Dec 17 '20 at 09:54