1

I have a csv file which contains data in the following format:

PrjID, Objective
1001 , (i) To improve efficiency (ii) Decrease cost (iii) Maximize revenue
1002 , a) Have fun b) Learn new things
1003 , (1) Getting tricky (2) Challanging task

First variable is an Id and the second variable is a text variable "objective". Each project has data on multiple objectives in a single column seperate by (i), (ii), ..etc or (a), (b), (c),..etc, or (1), (2), (3), ..etc. Now I want an observation created for each objective of the projects. Much like this:

PrjID, Objective
1001 , (i) To improve efficiency
1001 , (ii) Decrease cost
1001 , (iii) Maximize revenue
1002 , a) Have fun
1002 , b) Learn new things
1003 , (1) Getting tricky
1003 , (2) Challanging task

For the projects that have just one objective, it has just one row. But for multiple objectives it splits up the observation.

I am quite new to handling text data in R, can some R pro help me get started with this problem? Thanks in advance!

Shreyes
  • 3,601
  • 1
  • 17
  • 16

2 Answers2

4

Here one idea.

  1. Insert a new delimiter in your Objective column, using a clever regular expression
  2. Use this delimiter within strsplit to split the sentence in a vector
  3. Using by , to process the previous steps by ID.

Following this steps , I get this code:

ll <- by(dat,dat$PrjID,FUN = function(x){
        x.delim <- gsub(" (\\(?[a-x,0-9]*\\))",'#\\1',x$Objective)
        obj  = unlist(strsplit(x.delim,'#'))
        data.frame(PrjID= x$PrjID,objective=obj[-1])
})
## transform your list to a data.frame
do.call(rbind,ll)

      PrjID                 objective
1001.1  1001 (i) To improve efficiency
1001.2  1001        (ii) Decrease cost
1001.3  1001   (iii) Maximize revenue 
1002.1  1002               a) Have fun
1002.2  1002      b) Learn new things 
1003.1  1003        (1) Getting tricky
1003.2  1003      (2) Challanging task

PS , here dat is :

dat <- read.table(text='PrjID, Objective 
1001 , (i) To improve efficiency (ii) Decrease cost (iii) Maximize revenue 
1002 , a) Have fun b) Learn new things 
1003 , (1) Getting tricky (2) Challanging task',sep=',',header=TRUE)
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Sweet, awesome approach! The only challenge left to tackle is the cases where we have genuine brackets. Meaning if I have an objective like "(i) Learn stuff (and also do awesome stuff)". Any take on that? – Shreyes May 21 '13 at 14:13
  • 1
    if your items are just **a,b,c and sequence of i** you can change the regular expression to something like `(\\(?[a,b,c,i,0-9]{1-4}\\))`... – agstudy May 21 '13 at 14:16
3

Taking a leaf from the answer by agstudy, here is a solution which does not use a magic delimiter but does not preserve the indices of the points in the text:

// Matches:
// 1. Single letter prefixes: a), b) ... z)
// 2. Roman numerals (only small case): [i,x,c,m,v]+
// 3. Numeral indexes: [0-9]*
delim <- "((^|\\s)\\(?([a-z]|[i,x,c,m,v]+|[0-9]+)\\))"

ll <- by(dat, dat$PrjID, function (r) {
            each.obj <- str_split(r$Objective, delim)[[1]][-1]
            data.frame(PrjId = r$PrjID, Objective = str_trim(each.obj))
        })

do.call(rbind, ll)

       PrjId                     Objective
1001.1  1001     First(could be something)
1001.2  1001 Seconds (blah something else)
1001.3  1001      (how can thins be) Third
1002.1  1002         To improve efficiency
1002.2  1002                 Decrease cost
1002.3  1002              Maximize revenue
1003.1  1003                Getting tricky
1003.2  1003              Challanging task

dat in this case is:

> dat
  PrjID
1  1001
2  1002
3  1003
                                                                                    Objective
1 (i) First(could be something) b) Seconds (blah something else) (3) (how can thins be) Third
2                        (i) To improve efficiency (ii) Decrease cost (iii) Maximize revenue
3                                                     (1) Getting tricky (2) Challanging task 
musically_ut
  • 34,028
  • 8
  • 94
  • 106