Breaking a row of data, by brackets

Question

I am trying to the break a row of data. Unfortunately, all my runs are saved as one long row.

The first value, is the ID number. The last is the gender. (the middle two is not needed)

[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]

I would like to learn how to break the row, so each value in a bracket is separated into its own cell, and not just as one long row of data in brackets.

My strategy is to get R to recognize breaking at "]" or the space between "] ["

It seems like a very simple problem, but my stringsplit, substitute, and other arguments aren't working.

Please help? I'm just a little tilted/frustrated!

Thanks so much

Why are your arguments not working? What happens when you try? Show the code you're trying, and the errors you get. — pak, Mar 21 '17 at 02:18
wow I am really dumb, I just had to change the data frame to a character string and all my codes worked. Thanks for your help anyways fam! — usa_josh, Mar 21 '17 at 04:04

score 2 · Answer 1 · answered Mar 21 '17 at 03:01

2

You can do it all in one go using strsplit and some reshaping:

matrix(strsplit(txt, '[][ "]+')[[1]][-1], ncol=4, byrow=TRUE)
#     [,1]  [,2] [,3] [,4]    
#[1,] "131" "22" "2"  "male"  
#[2,] "123" "23" "2"  "female"
#[3,] "232" "21" "2"  "male"  
#[4,] "132" "21" "2"  "male"

Or via read.table after cleaning out the brackets:

read.table(text=gsub("^\\[\\[|\\] \\[|\\]\\]$", "\n", txt))
#   V1 V2 V3     V4
#1 131 22  2   male
#2 123 23  2 female
#3 232 21  2   male
#4 132 21  2   male

Where txt was:

txt <- '[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]'

answered Mar 21 '17 at 03:01

thelatemail

91,185
12
128
188

Another pattern: `"[][ ]{2,3}"` – Frank Mar 21 '17 at 03:05
hmm says non character argument... any ideas? – usa_josh Mar 21 '17 at 03:09
@usa_josh - you probably have a factor not a character vector. Just use `as.character(txt)` when splitting – thelatemail Mar 21 '17 at 03:12
Which if you'd Googled - "error in strsplit non-character argument" the first result would have told you this: http://stackoverflow.com/questions/15430016/non-character-argument-in-r-string-split-function-strsplit – thelatemail Mar 21 '17 at 03:16
wow I am really dumb, I just had to change the data frame to a character string and all my codes worked. Thanks for your help anyways fam! – usa_josh Mar 21 '17 at 04:04

score 0 · Answer 2 · answered Mar 21 '17 at 02:32

0

When I try strsplit, it complains about the special character [ in the string. The correct escape sequence is two backslashes:

s = '[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]' 
v0 = c(strsplit(s, "] \\["))

You'll finish with an array of strings where the first string has [[ at the start, and the last string has ]] at the end. Clean these up separately:

v1 = lapply(v0, function(s) gsub("\\[", "", s))
v2 = lapply(v1, function(s) gsub("]", "", s))

Hope this helps!

answered Mar 21 '17 at 02:32

lebelinoz

4,890
10
33
56

hmm says there is an error... non character argument! any ideas? – usa_josh Mar 21 '17 at 03:07
wow I am really dumb, I just had to change the data frame to a character string and all my codes worked. Thanks for your help anyways fam! – usa_josh Mar 21 '17 at 04:04

discipulus · Answer 3 · 2017-03-21T03:04:57.350

You can use a combination of tidyverse, reshape and stringr to get the desired result. separate_rows from tidyverse separates rows, i.e., breaks a row to multiple rows and similarly, separate from reshape forks new columns from the old ones. Since there are two closing brackets in the end with out any output we get two rows with NA values and warning, so we remove the NA values using na.omit(). If you want to select only first and last column, you can use select from dplyr

library(dplyr)
library(tidyverse)  # for separate_rows
library(reshape) #for separate to separate columns 
library(stringr) # for string manipulations i.e. remove tralining and leading white spaces 
# data frame from your data 
df_1 <- data.frame(col1='[[131 22 2 "male"] [123 23 2 "female"] [232 21 2 "male"] [132 21 2 "male"]]' , stringsAsFactors = FALSE)

# separate rows on closing brackets
df_2 <- df_1 %>%   separate_rows(col1, sep = "]")

# remove other remaining brackets and leading and triling white space
df_2["col1"] <- gsub("\\[|\\]", "", str_trim(df_2[["col1"]], "both") )

# separate the single column data to multiple columns
df_2 %>% separate(col = col1, into = c("ID", "Num1","Num2", "Gender"), sep = " ") %>% na.omit() %>% select(1,4)

The output will be

 A tibble: 4 × 2
     ID   Gender
  <chr>    <chr>
1   131   "male"
2   123 "female"
3   232   "male"
4   132   "male"

wow thank you! How could i go about doing this if I had say 500 rows? — usa_josh, Mar 21 '17 at 03:05
Number of rows does not matter, it should scale to any number of rows. — discipulus, Mar 21 '17 at 03:07

Breaking a row of data, by brackets

3 Answers3