1

I have multiple files each one has a different title, I want to extract the title name from each file. Here is an example of one file

[1] "<START"                        "ID=\"CMP-001\""                  "NO=\"1\">"                         
[4] "<NAME>Plasma-derived"          "vaccine"                         "(PDV)"                             
[7] "versus"                        "placebo"                         "by"                                
[10] "intramuscular"                "route</NAME>"                    "<DIC"                     
[13] "CHI2=\"3.6385\""              "CI_END=\"0.6042\""               "CI_START=\"0.3425\""   
[16] "CI_STUDY=\"95\""                "CI_TOTAL=\"95\""               "DF=\"3.0\""                        
[19] "TOTAL_1=\"0.6648\""           "TOTAL_2=\"0.50487622\""           "BLE=\"YES\"" 
.
.
.
 [789] "TOTAL_2=\"39\""             "WEIGHT=\"300.0\""              "Z=\"1.5443\">"    
 [792] "<NAME>Local"                "adverse"                       "events" 
 [795] "after"                      "each"                          "injection"
 [798] "of"                         "vaccine</NAME>"               "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
 [801] "</GROUP_LABEL_2>"           "<GRAPH_LABEL_1>"              "PDV</GRAPH_LABEL_1>"

the extracted expected title is

Plasma-derived vaccine (PDV) versus placebo by intramuscular route

Note, each file has a different title's length.

Me28
  • 107
  • 7
  • 1
    [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a workable sample of data, all necessary code, and a clear explanation of what you're trying to do and what hasn't worked. If you're working with text that represents XML, your safest bet is probably using functions that are specifically designed for that purpose rather than regex – camille Nov 07 '19 at 15:39

1 Answers1

0

Here is a solution using stringr. This first collapses the vector into one long string, and then captures all words / characters that are not a newline \n between every pair of "<NAME>" and "</NAME>". In the future, people will be able to help you easier if you make a reproducible example (e.g., using dput()). Hope this helps!

Note: if you just one the first title you can use str_match() instead of str_match_all().

library(stringr)

str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine" 

Data:

string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
            "TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")
Andrew
  • 5,028
  • 2
  • 11
  • 21