0

I have a column with lot of text, I just want to retain text that are between [start section id="20107"] and [end section id="20107"] rest are not important.

Here the original data

[start section id="20106"]

California, Death Valley 

[end section id="20106"]

[start section id="20107"]

1. Apple
2. Orange
3. Bannana
4. Kiwi
5. Grapes
6. Strawberry

[end section id="20107"]


[start section id="20108"]

Jose has worked on these farms , currently he is in Florida picking tomatos

[end section id="20108"]

What I am trying to do is just retain text between start section id="20107" and end section id="20107"

[start section id="20107"]

1. Apple
2. Orange
3. Bannana
4. Kiwi
5. Grapes
6. Strawberry

[end section id="20107"]

Any help on this topic is much appreciated.

  • What have you tried yourself? Here's how to create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Heroka Sep 18 '15 at 17:38
  • @Heroka, I have tried `testdf = filter(org_df, grepl('[start section id="20107"]|[end section id="20107"]', col1))` , i am not getting the right results, it shows the original column and does not get rid of text outside these start and end conditons – Bridgeport Byron Tucker Sep 18 '15 at 17:44
  • Please provide some sample data and add what you've tried to your answer. And grepl returns a string of booleans for a match inside the entire string. You might need gsub. – Heroka Sep 18 '15 at 17:49
  • @Heroka, cool , did that – Bridgeport Byron Tucker Sep 18 '15 at 18:09
  • Your code example is not in typical R syntax. Use `dput(mydata)` to post the actual code instead of an interpretive text copy. – Pierre L Sep 18 '15 at 18:26

1 Answers1

0

You may use sub

x <- '[start section id="20107"]

1. Apple
2. Orange
3. Bannana
4. Kiwi
5. Grapes
6. Strawberry

[end section id="20107"]


[start section id="20108"]

Jose has worked on these farms , currently he is in Florida picking tomatos

[end section id="20108"]'
cat(sub('[\\s\\S]*(\\[start section id="20107"\\][\\s\\S]*?\\[end section id="20107"\\])[\\s\\S]*', '\\1', x, perl=T))

#[start section id="20107"]

#1. Apple
#2. Orange
#3. Bannana
#4. Kiwi
#5. Grapes
#6. Strawberry

#[end section id="20107"]
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274