R: Split multiple rows into a list element based on pattern

Question

I'm trying to parse this .txt file in R: https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt

It's essentially a single column data frame of some ~2 million rows, with each entity being described by multiple rows and bookended by rows containing the string "//".

Ideally, I could capture each entity, made up of multiple rows, as a list element by splitting at "//", but I'm not sure of the most efficient way to go about this.

Any help is much appreciated.

EDIT:

Here's a snippet of what I'm working with:

[87] "//"                                                                                                                                                                                             
 [88] "ID   #40a"                                                                                                                                                                                      
 [89] "AC   CVCL_IW91"                                                                                                                                                                                 
 [90] "DR   Wikidata; Q54422071"                                                                                                                                                                       
 [91] "RX   PubMed=28159921;"                                                                                                                                                                          
 [92] "CC   Characteristics: Established from parent cell line after two passages in the peritoneal cavity of C57BL/6 mice (PubMed=28159921)."                                                         
 [93] "CC   Transformant: ChEBI; CHEBI:46666; Crocidolite asbestos."                                                                                                                                   
 [94] "CC   Derived from metastatic site: Peritoneum."                                                                                                                                                 
 [95] "CC   Breed/subspecies: C57BL/6."                                                                                                                                                                
 [96] "DI   NCIt; C21619; Mouse mesothelioma"                                                                                                                                                          
 [97] "OX   NCBI_TaxID=10090; ! Mus musculus"                                                                                                                                                          
 [98] "HI   CVCL_IW90 ! 40"                                                                                                                                                                            
 [99] "SX   Male"                                                                                                                                                                                      
[100] "AG   1-2M"                                                                                                                                                                                      
[101] "CA   Cancer cell line"                                                                                                                                                                          
[102] "DT   Created: 15-05-17; Last updated: 02-07-20; Version: 3"                                                                                                                                     
[103] "//"                                                                                                                                                                                             
[104] "ID   #490"                                                                                                                                                                                      
[105] "AC   CVCL_B375"                                                                                                                                                                                 
[106] "SY   490; Mab 7; Mab7"                                                                                                                                                                          
[107] "DR   CLO; CLO_0001018"                                                                                                                                                                          
[108] "DR   ATCC; HB-12029"                                                                                                                                                                            
[109] "DR   Wikidata; Q54422073"                                                                                                                                                                       
[110] "RX   Patent=US5616470;"                                                                                                                                                                         
[111] "CC   Monoclonal antibody isotype: IgM, kappa."                                                                                                                                                  
[112] "CC   Monoclonal antibody target: Cronartium ribicola antigens."                                                                                                                                 
[113] "OX   NCBI_TaxID=10090; ! Mus musculus"                                                                                                                                                          
[114] "HI   CVCL_4032 ! P3X63Ag8.653"                                                                                                                                                                  
[115] "CA   Hybridoma"                                                                                                                                                                                 
[116] "DT   Created: 06-06-12; Last updated: 12-03-20; Version: 6"                                                                                                                                     
[117] "//"                                                                                                                                                                                             
[118] "ID   #822"                                                                                                                                                                                      
[119] "AC   CVCL_X345"                                                                                                                                                                                 
[120] "SY   822; Mab 13; Mab13"                                                                                                                                                                        
[121] "DR   ATCC; HB-12030"                                                                                                                                                                            
[122] "DR   Wikidata; Q54422076"                                                                                                                                                                       
[123] "RX   Patent=US5616470;"                                                                                                                                                                         
[124] "CC   Monoclonal antibody isotype: IgM, kappa."                                                                                                                                                  
[125] "CC   Monoclonal antibody target: Cronartium ribicola antigens."                                                                                                                                 
[126] "OX   NCBI_TaxID=10090; ! Mus musculus"                                                                                                                                                          
[127] "HI   CVCL_4032 ! P3X63Ag8.653"                                                                                                                                                                  
[128] "CA   Hybridoma"                                                                                                                                                                                 
[129] "DT   Created: 17-07-14; Last updated: 12-03-20; Version: 5"                                                                                                                                     
[130] "//"

As an added clarification, my goal is to search for a given accession (AC), e.g. CVCL_X345, and then extract age (AG) and sex (SX) for that accession if they are available.

Maybe you should provide us a few rows of this data. ]frame so we can understand the pattern. — GuedesBF, Jun 01 '21 at 18:00
do you have the file in your PC or are you reading directly from the site? — Onyambu, Jun 01 '21 at 18:37
Sorry, had another question opened which was tagged with python. — Bart Kiers, Jun 01 '21 at 18:44
It looks like `readLines(...)` is what you need: https://stackoverflow.com/questions/12626637/read-a-text-file-in-r-line-by-line — Bart Kiers, Jun 01 '21 at 19:13
Right, I am using that to get the data into R. I'm wondering about an efficient way to group entities by splitting on a character from there... — Rebecca Eliscu, Jun 01 '21 at 19:38

user12728748 · Accepted Answer · 2021-06-02T11:06:28.077

Here is one solution using data.table.

library(data.table)
dt <- fread("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt", 
            skip=54, header=FALSE, sep='')
dt[, c("code", "content"):=tstrsplit(sub(" +", "@/@", V1), "@/@") ][, 
  `:=` (V1=NULL, ID=cumsum(code=="//")+1)]
dt <- dt[code!="//"]
dt[dt[content=="CVCL_IW91"], on="ID"][code %chin% c("SX", "AG")]
#>    code content ID i.code i.content
#> 1:   SX    Male  3     AC CVCL_IW91
#> 2:   AG    1-2M  3     AC CVCL_IW91

# or get all of them:
dcast(dt[code %in% c("SX", "AG", "AC")][, .(code, content), by=ID], ID ~ ...,
      value.var="content")
#>             ID        AC              AG     SX
#>      1:      1 CVCL_E548 Age unspecified Female
#>      2:      2 CVCL_KA96            <NA>   <NA>
#>      3:      3 CVCL_IW91            1-2M   Male
#>      4:      4 CVCL_B375            <NA>   <NA>
#>      5:      5 CVCL_X345            <NA>   <NA>
#>     ---                                        
#> 128802: 128802 CVCL_A6IX             29Y   Male
#> 128803: 128803 CVCL_ZB29             57Y Female
#> 128804: 128804 CVCL_ZB30             32Y Female
#> 128805: 128805 CVCL_A3ZF             26Y Female
#> 128806: 128806 CVCL_3449            <NA>   Male

^{Created on 2021-06-01 by the reprex package (v2.0.0)}

Edit: Brief explanation:

In essence, I want to split each row on the first blank(s) first. I achieve this by replacing these with a separator that does not exist in the entire text (previously checked with grep), then use tstrsplit to split the first column V1 based on this separator into two (code and content). Then I remove V1 and use cumsum to increase the identifier ID based on the occurrence of the separator lines (//) to label each record with its own identifier.

This is so great - thank you! Any chance you could break down what's going on in the third line? — Rebecca Eliscu, Jun 02 '21 at 06:09

R: Split multiple rows into a list element based on pattern

1 Answers1