How to match multi-row data using regular expression R programming

Question

I import a txt document into R using readLines, but the document is transformed into a charactor vector, namely,every element in the vector denote a line in the txt document, so that I cannot use regular expression to match the multi-row data.How to sove this problem?

example document test.txt

ID   cel-let-7         standard; RNA; CEL; 99 BP.

XX

AC   MI0000001;

XX
DE   Caenorhabditis elegans let-7 stem-loop

XX

RN   [1]

RX   PUBMED; 11679671.

RA   Lau NC, Lim LP, Weinstein EG, Bartel DP;

RT   "An abundant class of tiny RNAs with probable regulatory roles in

RT   Caenorhabditis elegans";

RL   Science. 294:858-862(2001).

I need the data between ID and DE,but the code below don't work, because no way to match multi-row.

pattern <- 'ID.+\nXX\nAC.+\nXX')
m <- gregexpr(pattern, text, perl = T)

perhaps there has another method but I only want to solve using regular expression.

I'm not sure what's your desired output, but you could try something like that I guess `indx <- grep("^(ID|DE)", text) ; paste(text[indx[1]:(indx[2] - 1)], collapse = " ")` — David Arenburg, Dec 30 '14 at 12:39
Have a look at http://stackoverflow.com/questions/14261776/extracting-data-from-text-files which may help — DarrenRhodes, Dec 30 '14 at 12:41

score 0 · Accepted Answer · answered Dec 30 '14 at 12:40

0

The below command would fetch the lines between ID and DE

> f <- paste(readLines("file"), collapse="\n")
> m <- gregexpr("(?m)^ID.*\\n\\K[\\S\\s]*?(?=\\nDE)", f, perl=TRUE)
> regmatches(f, m)
[[1]]
[1] "\nXX\n\nAC   MI0000001;\n\nXX"

OR

> m <- gregexpr("(?s)^ID.*?\\nDE", f, perl=TRUE)
> regmatches(f, m)
[[1]]
[1] "ID   cel-let-7         standard; RNA; CEL; 99 BP.\n\nXX\n\nAC   MI0000001;\n\nXX\nDE"

answered Dec 30 '14 at 12:40

Avinash Raj

172,303
28
230
274

Firstly,thanks for your answer that solves my problem.In addition,I am puzzled about the **(?m)** and **(?s)** in the patterns @Avinash Raj. – Kang Li Jan 03 '15 at 09:43
It's called multiline modifier which enables multiline mode – Avinash Raj Jan 03 '15 at 10:46

How to match multi-row data using regular expression R programming

1 Answers1