1
Fruits
john bought banana and kept 7 days from 15 apr 2015
marker bought apple and kept 10 days from 11 jan 2015
shannon bought apple, banana and kept 12 days from 11 feb 2015
mckinsey bought banana and kept 19 days from 11 dec 2015
george bought banana and kept 17 days from 11 feb 2015
mesa bought banana and kept 10 days from 11 jan 2015
mac bought banana and kept 7 days from 11 sep 2015
henric didn’t buy the fruit

This is column content in the dataframe. I wanted to extract the information of date month year (eg,11 jan 2015) and store it another column.Then I want to extract the number of days("19 days) store it another column in the same dataframe.

i have tried so far.

date <- gsub("[^0-9]", " ", dataframe$fruits)# wrong

but the code doesnt seem to be right. can anyone help me please? Thanks in advance.

Raj
  • 53
  • 1
  • 9
  • 1
    So your data.frame contains 8 rows and 1 column (`Fruits`), and each value is a string of characters? – Adam Quek May 08 '17 at 03:43

2 Answers2

1

We can do this with str_extract to create the 'Date' (by matching 2 numbers ([0-9]{2}) followed by one or more space (\\s+) followed by three letters ([A-z]{3}) followed by 4 numbers ([0-9]{4}) at the end ($) of the string) and the 'Days' (one or more numbers (\\d+) followed by zero or more spaces (\\s*) followed by the 'days' string)

library(stringr)
df1$Date <- str_extract(df1$Fruits, "[0-9]{2}\\s+[A-z]{3}\\s+[0-9]{4}$")
df1$Days <- str_extract(df1$Fruits, "\\d+\\s*days")
df1
                                                          #Fruits        Date    Days
#1            john bought banana and kept 7 days from 15 apr 2015 15 apr 2015  7 days
#2          marker bought apple and kept 10 days from 11 jan 2015 11 jan 2015 10 days
#3 shannon bought apple, banana and kept 12 days from 11 feb 2015 11 feb 2015 12 days
#4       mckinsey bought banana and kept 19 days from 11 dec 2015 11 dec 2015 19 days
#5         george bought banana and kept 17 days from 11 feb 2015 11 feb 2015 17 days
#6           mesa bought banana and kept 10 days from 11 jan 2015 11 jan 2015 10 days
#7             mac bought banana and kept 7 days from 11 sep 2015 11 sep 2015  7 days
#8                                    henric didn’t buy the fruit        <NA>    <NA>

data

 df1 <- structure(list(Fruits = c("john bought banana and kept 7 days from 15 apr 2015", 
"marker bought apple and kept 10 days from 11 jan 2015", "shannon bought apple, banana and kept 12 days from 11 feb 2015", 
"mckinsey bought banana and kept 19 days from 11 dec 2015", "george bought banana and kept 17 days from 11 feb 2015", 
"mesa bought banana and kept 10 days from 11 jan 2015", "mac bought banana and kept 7 days from 11 sep 2015", 
"henric didn’t buy the fruit")), .Names = "Fruits", class = "data.frame", row.names = c(NA, 
-8L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @Raj If you don't understand how the regex here works I would recommend trying a regex testing site like regex101- it does a pretty good job with [this answer](https://regex101.com/r/gYMNZB/1) – Marius May 08 '17 at 03:49
  • 2
    @Raj I created a reproducible example with the 'data'. It is working for me. Please check the `str(yourdata)` to see if it is a `matrix` or not. If it is `matrix`, then use `df1[,1]` – akrun May 08 '17 at 03:55
  • 1
    Similarly, if it isn't working with the data that @akrun had to create to mimic what was inferred from your question, then it would be beneficial for you to provide a [reproducible question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including a sufficient subset of your data to be able to reproduce it accurately. – r2evans May 08 '17 at 03:57
1

You can separate everything with strsplit and then reassemble:

df <- read.csv2(text = 'Fruits
john bought banana and kept 7 days from 15 apr 2015
marker bought apple and kept 10 days from 11 jan 2015
shannon bought apple, banana and kept 12 days from 11 feb 2015
mckinsey bought banana and kept 19 days from 11 dec 2015
george bought banana and kept 17 days from 11 feb 2015
mesa bought banana and kept 10 days from 11 jan 2015
mac bought banana and kept 7 days from 11 sep 2015
henric didn’t buy the fruit')

split_text <- strsplit(as.character(df$Fruits), ' bought | and kept | days from ')

df2 <- data.frame(do.call(rbind, split_text[lengths(split_text) == 4]), stringsAsFactors = FALSE)
names(df2) <- c('name', 'fruit', 'days', 'date')

df2$days <- as.integer(df2$days)
df2$date <- as.Date(df2$date, '%d %b %Y')

df2
#>       name         fruit days       date
#> 1     john        banana    7 2015-04-15
#> 2   marker         apple   10 2015-01-11
#> 3  shannon apple, banana   12 2015-02-11
#> 4 mckinsey        banana   19 2015-12-11
#> 5   george        banana   17 2015-02-11
#> 6     mesa        banana   10 2015-01-11
#> 7      mac        banana    7 2015-09-11

Note you have to subset out the last observation, as it doesn't correspond to the pattern.

alistaire
  • 42,459
  • 4
  • 77
  • 117