1

I am trying to the break a row of data. Unfortunately, all my runs are saved as one long row.

The first value, is the ID number. The last is the gender. (the middle two is not needed)

[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]

I would like to learn how to break the row, so each value in a bracket is separated into its own cell, and not just as one long row of data in brackets.

My strategy is to get R to recognize breaking at "]" or the space between "] ["

It seems like a very simple problem, but my stringsplit, substitute, and other arguments aren't working.

Please help? I'm just a little tilted/frustrated!

Thanks so much

thelatemail
  • 91,185
  • 12
  • 128
  • 188
usa_josh
  • 11
  • 4
  • Why are your arguments not working? What happens when you try? Show the code you're trying, and the errors you get. – pak Mar 21 '17 at 02:18
  • wow I am really dumb, I just had to change the data frame to a character string and all my codes worked. Thanks for your help anyways fam! – usa_josh Mar 21 '17 at 04:04

3 Answers3

2

You can do it all in one go using strsplit and some reshaping:

matrix(strsplit(txt, '[][ "]+')[[1]][-1], ncol=4, byrow=TRUE)
#     [,1]  [,2] [,3] [,4]    
#[1,] "131" "22" "2"  "male"  
#[2,] "123" "23" "2"  "female"
#[3,] "232" "21" "2"  "male"  
#[4,] "132" "21" "2"  "male" 

Or via read.table after cleaning out the brackets:

read.table(text=gsub("^\\[\\[|\\] \\[|\\]\\]$", "\n", txt))
#   V1 V2 V3     V4
#1 131 22  2   male
#2 123 23  2 female
#3 232 21  2   male
#4 132 21  2   male

Where txt was:

txt <- '[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]' 
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Another pattern: `"[][ ]{2,3}"` – Frank Mar 21 '17 at 03:05
  • hmm says non character argument... any ideas? – usa_josh Mar 21 '17 at 03:09
  • @usa_josh - you probably have a factor not a character vector. Just use `as.character(txt)` when splitting – thelatemail Mar 21 '17 at 03:12
  • Which if you'd Googled - "error in strsplit non-character argument" the first result would have told you this: http://stackoverflow.com/questions/15430016/non-character-argument-in-r-string-split-function-strsplit – thelatemail Mar 21 '17 at 03:16
  • wow I am really dumb, I just had to change the data frame to a character string and all my codes worked. Thanks for your help anyways fam! – usa_josh Mar 21 '17 at 04:04
0

When I try strsplit, it complains about the special character [ in the string. The correct escape sequence is two backslashes:

s = '[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]' 
v0 = c(strsplit(s, "] \\["))

You'll finish with an array of strings where the first string has [[ at the start, and the last string has ]] at the end. Clean these up separately:

v1 = lapply(v0, function(s) gsub("\\[", "", s))
v2 = lapply(v1, function(s) gsub("]", "", s))

Hope this helps!

lebelinoz
  • 4,890
  • 10
  • 33
  • 56
  • hmm says there is an error... non character argument! any ideas? – usa_josh Mar 21 '17 at 03:07
  • wow I am really dumb, I just had to change the data frame to a character string and all my codes worked. Thanks for your help anyways fam! – usa_josh Mar 21 '17 at 04:04
0

You can use a combination of tidyverse, reshape and stringr to get the desired result. separate_rows from tidyverse separates rows, i.e., breaks a row to multiple rows and similarly, separate from reshape forks new columns from the old ones. Since there are two closing brackets in the end with out any output we get two rows with NA values and warning, so we remove the NA values using na.omit(). If you want to select only first and last column, you can use select from dplyr

library(dplyr)
library(tidyverse)  # for separate_rows
library(reshape) #for separate to separate columns 
library(stringr) # for string manipulations i.e. remove tralining and leading white spaces 
# data frame from your data 
df_1 <- data.frame(col1='[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]' , stringsAsFactors = FALSE)

# separate rows on closing brackets
df_2 <- df_1 %>%   separate_rows(col1, sep = "]")

# remove other remaining brackets and leading and triling white space
df_2["col1"] <- gsub("\\[|\\]", "", str_trim(df_2[["col1"]], "both") )

# separate the single column data to multiple columns
df_2 %>% separate(col = col1, into = c("ID", "Num1","Num2", "Gender"), sep = " ") %>% na.omit() %>% select(1,4)

The output will be

 A tibble: 4 × 2
     ID   Gender
  <chr>    <chr>
1   131   "male"
2   123 "female"
3   232   "male"
4   132   "male"
discipulus
  • 2,665
  • 3
  • 34
  • 51