I'm trying to parse this .txt file in R: https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt
It's essentially a single column data frame of some ~2 million rows, with each entity being described by multiple rows and bookended by rows containing the string "//".
Ideally, I could capture each entity, made up of multiple rows, as a list element by splitting at "//", but I'm not sure of the most efficient way to go about this.
Any help is much appreciated.
EDIT:
Here's a snippet of what I'm working with:
[87] "//"
[88] "ID #40a"
[89] "AC CVCL_IW91"
[90] "DR Wikidata; Q54422071"
[91] "RX PubMed=28159921;"
[92] "CC Characteristics: Established from parent cell line after two passages in the peritoneal cavity of C57BL/6 mice (PubMed=28159921)."
[93] "CC Transformant: ChEBI; CHEBI:46666; Crocidolite asbestos."
[94] "CC Derived from metastatic site: Peritoneum."
[95] "CC Breed/subspecies: C57BL/6."
[96] "DI NCIt; C21619; Mouse mesothelioma"
[97] "OX NCBI_TaxID=10090; ! Mus musculus"
[98] "HI CVCL_IW90 ! 40"
[99] "SX Male"
[100] "AG 1-2M"
[101] "CA Cancer cell line"
[102] "DT Created: 15-05-17; Last updated: 02-07-20; Version: 3"
[103] "//"
[104] "ID #490"
[105] "AC CVCL_B375"
[106] "SY 490; Mab 7; Mab7"
[107] "DR CLO; CLO_0001018"
[108] "DR ATCC; HB-12029"
[109] "DR Wikidata; Q54422073"
[110] "RX Patent=US5616470;"
[111] "CC Monoclonal antibody isotype: IgM, kappa."
[112] "CC Monoclonal antibody target: Cronartium ribicola antigens."
[113] "OX NCBI_TaxID=10090; ! Mus musculus"
[114] "HI CVCL_4032 ! P3X63Ag8.653"
[115] "CA Hybridoma"
[116] "DT Created: 06-06-12; Last updated: 12-03-20; Version: 6"
[117] "//"
[118] "ID #822"
[119] "AC CVCL_X345"
[120] "SY 822; Mab 13; Mab13"
[121] "DR ATCC; HB-12030"
[122] "DR Wikidata; Q54422076"
[123] "RX Patent=US5616470;"
[124] "CC Monoclonal antibody isotype: IgM, kappa."
[125] "CC Monoclonal antibody target: Cronartium ribicola antigens."
[126] "OX NCBI_TaxID=10090; ! Mus musculus"
[127] "HI CVCL_4032 ! P3X63Ag8.653"
[128] "CA Hybridoma"
[129] "DT Created: 17-07-14; Last updated: 12-03-20; Version: 5"
[130] "//"
As an added clarification, my goal is to search for a given accession (AC), e.g. CVCL_X345, and then extract age (AG) and sex (SX) for that accession if they are available.