0

I have a single field of semantic tags & semantic tag types. Each tag type/tag is comma-separated, while each tag type & tag are colon separated (see below).

ID | Semantic Tags

1  |   Person:mitch mcconnell, Person:ashley judd, Position:senator

2  |   Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 

3  |   Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 

4  |   Person:ashley judd, topicname:politics

5  |   URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc

I want to split each tag type (term before colon) & tag (term after colon) into two separate fields: "Tag Type" & "Tag". The resulting file should look something like this:

ID | Tag Type  |  Tag

1  |  Person   |  mitch McConnell

1  |  Person   |  ashley judd  

1  |  Position |  senator

2  |  Person   |  mitch McConnell

2  |  Position |  senator

2  |  State    |  kentucky

Here is the code I have so far...

tag<-strsplit(as.character(emtable$Symantic.Tags),","))
tagtype<-strsplit(as.character(tag),":")

But after that, I'm lost! I believe I need to use lapply or sapply for this, but am not sure where that plays in...

My apologies if this has been answered in other forms on the site -- I am new to R & this is still a bit complex for me.

Thanks in advance for anyone's help.

CHP
  • 16,981
  • 4
  • 38
  • 57
NiuBiBang
  • 628
  • 1
  • 15
  • 30

2 Answers2

4

This is another (slightly different) approach:

## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)),
    do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))
)

colnames(dat3)[-1] <- c("Tag Type", "Tag")

##    ID        Tag Type                    Tag
## 1   1          Person        mitch mcconnell
## 2   1          Person            ashley judd
## 3   1        Position                senator
## 4   2          Person        mitch mcconnell
## 5   2        Position                senator
## 6   2 ProvinceOrState               kentucky
## 7   2       topicname               politics
## 8   3          Person        mitch mcconnell
## 9   3          Person            ashley judd
## 10  3    Organization                 senate
## 11  3    Organization             republican
## 12  4          Person            ashley judd
## 13  4       topicname               politics
## 14  5             URL www.huffingtonpost.com
## 15  5         Company              usa today
## 16  5          Person             chuck todd
## 17  5         Company                  msnbc

Thorough explanation:

## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /

# Let the explanation begin...

# Here I have a short list of the variables from the rows
# of the original dataframe; in this case the row numbers:

seq_along(dat3)      #row variables

# then I use sapply and length to figure out hoe long the
# split variables in each row (now a list) are

sapply(dat3, length) #n times

# this tells me how many times to repeat the short list of 
# variables.  This is because I stretch the dat3 list to a vector
# Here I rep the row variables n times

rep(seq_along(dat3), sapply(dat3, length))

# better assign that for later:

ID <- rep(seq_along(dat3), sapply(dat3, length))

#============================================
# Now to explain the next chunk...
# I take dat3

dat3

# Each element in the list 1-5 is made of a new list of 
# Vectors of length 2 of Tag_Types and Tags.
# For instance here's element 5 a list of two  lists 
# with character vectors of length 2 

## [[5]]
## [[5]][[1]]
## [1] "URL"  "www.huffingtonpost.com"
## 
## [[5]][[2]]
## [1] "URL"  "http://www.regular-expressions.info"

# Use str to look at this structure:

dat3[[5]]
str(dat3[[5]])

## List of 2
##  $ : chr [1:2] "URL" "www.huffingtonpost.com"
##  $ : chr [1:2] "URL" "http://www.regular-expressions.info"

# I use lapply (list apply) to apply an anynomous function:
# function(x) do.call(rbind, x) 
#
# TO each of the 5 elements.  This basically glues the list 
# of vectors together to make a matrix.  Observe just on elenet 5:

do.call(rbind, dat3[[5]])

##      [,1]  [,2]                                 
## [1,] "URL" "www.huffingtonpost.com"             
## [2,] "URL" "http://www.regular-expressions.info"

# We use lapply to do that to all elements:

lapply(dat3, function(x) do.call(rbind, x))

# We then use the do.call(rbind on this list and we have a 
# matrix

do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))

# Let's assign that for later:

the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))

#============================================    
# Now we put it all together with data.frame:

data.frame(ID, the_mat)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • This seems to be doing the trick. However, when I run the third command: `dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) )` I get the following message: Error in function (..., deparse.level = 1) : number of columns of matrices must match (see arg 2) In addition: There were 50 or more warnings (use warnings() to see the first 50) – NiuBiBang Apr 10 '13 at 18:23
  • This issue is specific to your data and it doesn't look like the data you've shown here. You can use debugging tools like `debug` to figure out the first issue and for the second I'd do as it says and use `warnings()` to see more specifically why you get the warnings you do. – Tyler Rinker Apr 10 '13 at 18:58
  • yes, I saw that one of my tag types was URL, which frequently contained "http:" -- that ended up breaking the matrix into a non-uniform number of columns when splitting on ":". So I just added a line of code to remove the "http:", b/n the 1st & 2nd strsplit codes. – NiuBiBang Apr 14 '13 at 01:36
  • @Niu you figured it out but there's a regex that could have helped. See my edit and [Josh's answer](http://stackoverflow.com/a/15816365/1000343) that this changed is based on. – Tyler Rinker Apr 14 '13 at 02:41
  • sorry to re-open a closed case, but could you tell me how I edit the following line of code, `dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) )`, to include other variables that should be repeated down the sequence; such as date, source of post, etc. e.g. if ID 1 was published on 1/2/2012, I would want to see a Date field with 1/2/2012 for all of ID 1's records. I understand the technicality behind the line of code itself, but not the principle as to apply it elsewhere. – NiuBiBang Apr 17 '13 at 19:53
  • Tyler, the explanation is much appreciated. I (mostly) understand what is going on now. But how does this line of code `ID <- rep(seq_along(dat3), sapply(dat3, length))` "know" to call & repeat ID variable? What if I had a Date variable, ID variable, etc? I tried to replicate your code with a Date variable `Date <- rep(dat$Date, seq_along(dat3), sapply(dat3, length))`, & received: _Warning message: In rep(et$Date, seq_along(et3), sapply(et3, length)) : first element used of 'length.out' argument_ & produce a character string of two elements. – NiuBiBang Apr 19 '13 at 20:40
  • Try: `rep(dat$Date, seq_along(dat3), sapply(dat3, length))` If this doesn't do it please open a new question with the specific data you're talking about. – Tyler Rinker Apr 19 '13 at 20:58
3
DF
##   ID                                                                                  Semantic.Tags
## 1  1                                   Person:mitch mcconnell, Person:ashley judd, Position:senator
## 2  2        Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## 3  3      Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## 4  4                                                         Person:ashley judd, topicname:politics
## 5  5                URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc


ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":")

f <- function(x) do.call(rbind, x)

f(lapply(ll, f))
##       [,1]               [,2]                    
##  [1,] "     Person"      "mitch mcconnell"       
##  [2,] " Person"          "ashley judd"           
##  [3,] " Position"        "senator"               
##  [4,] "     Person"      "mitch mcconnell"       
##  [5,] " Position"        "senator"               
##  [6,] " ProvinceOrState" "kentucky"              
##  [7,] " topicname"       "politics "             
##  [8,] "     Person"      "mitch mcconnell"       
##  [9,] " Person"          "ashley judd"           
## [10,] " Organization"    "senate"                
## [11,] " Organization"    "republican "           
## [12,] "     Person"      "ashley judd"           
## [13,] " topicname"       "politics"              
## [14,] "     URL"         "www.huffingtonpost.com"
## [15,] " Company"         "usa today"             
## [16,] " Person"          "chuck todd"            
## [17,] " Company"         "msnbc"                 
CHP
  • 16,981
  • 4
  • 38
  • 57