Parsing of PMCID table row to column form

Question

dput(t1)
structure(list(PMCID = c("PMC7809753", "PMC7809753", "PMC7809753", 
"PMC7809753", "PMC7809753", "PMC7790830", "PMC7790830", "PMC7790830", 
"PMC7790830", "PMC7790830"), table = c("Table 1", "Table 1", 
"Table 1", "Table 1", "Table 1", "Table 1", "Table 1", "Table 1", 
"Table 1", "Table 1"), row = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
4L, 5L), text = c("Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK. Inactivation: CDA, dCMPD, PN-I.; Efflux=MRP4,7,8; Refs.=[14, 30–33, 78–80]", 
"Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[44, 51, 81–84]", 
"Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44, 85–90]", 
"Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, 91, 92]", 
"Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutamylation); Efflux=P-gp, MRP1-5, BCRP; Refs.=[16, 93, 94]", 
"Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; Cell count(×109/l): PLT=9; BM Blast (%)=70.5; Karyotype=46,XX,t(8,21)(q22;q22)", 
"Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103; Cell count(×109/l): PLT=62; BM Blast (%)=60.4; Karyotype=46,XX", 
"Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; Cell count(×109/l): PLT=100; BM Blast (%)=88; Karyotype=45,XY,-7", 
"Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; Cell count(×109/l): PLT=52; BM Blast (%)=86.8; Karyotype=46,XY", 
"Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; Cell count(×109/l): PLT=197; BM Blast (%)=32.4; Karyotype=46,XX"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

The above one is my sample data frame which looks like this

head(t1)
# A tibble: 6 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table…     2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table…     3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table…     4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table…     5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…
6 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …

For example this paper PMC7809753 paper whose output is above. In paper the First table is "Properties of the chemotherapeutic drugs used in AML" looks like this. In my data frame the Table 1 of PMC7809753 ID is repeated 5 times which corresponds to the above pic i have attached.

Now the The issue is how do i parse each table of particular PMCID into a tabular or column like structure as shown in the paper.

UPDATE Based on my PMCID I can split each of the row into a list.

aa <- split(t1, f = t1$PMCID)

which gives me this

$PMC7790830
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …
2 PMC7790830 Table…     2 Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103…
3 PMC7790830 Table…     3 Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; …
4 PMC7790830 Table…     4 Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; C…
5 PMC7790830 Table…     5 Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; …

$PMC7809753
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table…     2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table…     3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table…     4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table…     5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…

UPDATE v2

I tried to segregate the same PMCID rows into one based on the below solution.

Convert duplicate rows to separate columns in R

library(splitstackshape)
library(data.table)
DT <- setDT(t1)[, do.call(paste, c(.SD, list(collapse=', '))) , PMCID]
DT1 <- cSplit(DT, 'V1', sep='[ ,]+', fixed=FALSE, stripWhite=TRUE)
setnames(DT1, 2:ncol(DT1), rep(names(t1)[-1], 41))
DT1

So still the problem remains as above how do i separate and segregate those rows corresponding to the list into column or some tabular form as shown in the pic.

Do you want to turn `t1` to something similar that you have in the image? The first 5 rows of t1 corresponds to 1st row in the image? — Ronak Shah, Jan 29 '21 at 10:55
yes that is what my objective is .or else it becomes difficult to read the rows. Actually the t1 is a result of that image .Im using this europmc library to parse data from drugs and diseases. So the parsed output is in a tabular form which is t1. — PesKchan, Jan 29 '21 at 11:06
"The first 5 rows of t1 corresponds to 1st row in the image?" yes — PesKchan, Jan 29 '21 at 11:15

Ben · Answer 1 · 2021-01-31T15:06:42.967

2

I think it may be helpful to use tidypmc package with your europepmc output. Here is an example of extracting the first table from your PMC article using pmc_table. This also uses map from purrr in tidyverse.

library(tidypmc)
library(tidyverse)
library(europepmc)

doc <- map("PMC7809753", epmc_ftxt)
tbls <- pmc_table(doc[[1]])
tbls[[1]]

Output

# A tibble: 7 x 6
  Drug                Target           Influx            Metabolisma                                 Efflux         Refs.        
  <chr>               <chr>            <chr>             <chr>                                       <chr>          <chr>        
1 Cytarabine (Ara-C)  DNA polymerases  ENT1, CNT3, OCTN1 "Activation: dCK, dCMPK, NDK. Inactivation… MRP4,7,8       [14, 30–33, …
2 Daunorubicin (DNR)  DNA, Topoisomer… Passive diffusion ""                                          P-gp, MRP1,7,… [44, 51, 81–…
3 Mitoxantrone (MX)   DNA, Topoisomer… Passive diffusion ""                                          P-gp, MRP1, B… [44, 85–90]  
4 Etoposide (VP-16)   Topoisomerase II Passive diffusion ""                                          P-gp, MRP1-3,… [16, 91, 92] 
5 Methotrexate (MTX)  DHFR, TS, AICAR… RFC, PCFT         "Aldehyde oxidase, FPGS (polyglutamylation… P-gp, MRP1-5,… [16, 93, 94] 
6 Venetoclax (VEN)    Bcl-2            Passive diffusion ""                                          P-gp           [72, 95]     
7 Gemtuzumab Ozogami… DNA              Ab-mediated endo… "Lysosomal Calicheamicin cleavage from Ab,… P-gp, MRP1     [73, 77]

Edit (1/30/21): To automate this process for multiple articles (and based on your other question and approach), consider the following.

You can have a vector containing your pmcids, and use that with map. This will create docs containing all the xml for all the pmcids articles.

Then you can use map again to store all the tables in my_tables, which would be a list.

b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 6)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]
docs <- map(pmcids, epmc_ftxt)
my_tables <- map(docs, pmc_table)

You can then access, for example, article 2 table 1 by:

my_tables[[2]][[1]]

Edit (1/31/21): To set the names of each article to the PMCID, you can use set_names, and chain using %>% with map. set_names will add names to your vector. When you call this function, but don't provide additional names, it will use the vector elements as the names. For example:

docs <- pmcids %>%
  set_names() %>%
  map(., epmc_ftxt)

You can call separately my_tables <- map(docs, pmc_table) afterwards, or even add this to the chain (storing the whole thing as my_tables) if only interested in tables, and not the full documents.

Ultimately, you could then access individual tables using the PMCID like this:

my_tables[["PMC7806552"]][[1]]

edited Jan 31 '21 at 15:06

answered Jan 29 '21 at 23:27

Ben

28,684
5
23
45

"doc <- map("PMC7809753", epmc_ftxt)", So do i have to run this one by one for each PMCID? – PesKchan Jan 30 '21 at 05:17
1

You can supply `map` with a list or vector of articles. I'm not sure exactly what your starting point is in terms of PMCIDs, but say you have a vector of articles: `vec <- unique(t1$PMCID)`...this would include both articles in your `t1`...you also can create by: `vec <- c("PMC7809753", "PMC7790830")`...then you provide the vector to `map` like this: `doc <- map(vec, epmc_ftxt)`... – Ben Jan 30 '21 at 15:50
here was one of the solution was given to me https://stackoverflow.com/questions/65969371/creating-a-function-to-fetch-europmc-literature-to-skip-paper-which-doest-retur/65969520#65969520 – PesKchan Jan 30 '21 at 15:51
1

you can access the tables from the first article by `tbls <- pmc_table(doc[[1]])`...or substitute 2 instead of 1 for tables from the second article...then, you can access each table from the given article as `tbls[[1]]` (e.g., for first table)... – Ben Jan 30 '21 at 15:53
1

Oh - glad you got the help you needed! That's great! – Ben Jan 30 '21 at 15:54
yes i did that yesterday . But I was looking how to automate it through a function. – PesKchan Jan 30 '21 at 15:54
as biology grad I have to break my head go thorough tons of stack answer its kind of making an assembly of codes from 10 places to and make it work – PesKchan Jan 30 '21 at 15:55
1

Please see edited answer (bottom) - this would allow for multiple articles, and storing all the tables as a list. – Ben Jan 30 '21 at 16:10
Error in as_mapper(.f, ...) : object 'pmc_table' not found yes this will come as i ran only the edited code which you made .How do i incorporate pmc_table in the edited code as in the first part you made that object with single query only.. – PesKchan Jan 30 '21 at 16:57
1

`pmc_table` is a function in `tidypmc` - did you load that library? You need `library(tidypmc)` as before. – Ben Jan 30 '21 at 17:00
1

To double check, on my computer I restarted R, cleared my environment, loaded the 3 libraries (tidypmc, tidyverse, europepmc), and the 4 lines of code, and it worked for me. – Ben Jan 30 '21 at 17:01
1

yes i restarted ,my R sessrion it works ..just fine no need of complicated loop at all – PesKchan Jan 30 '21 at 17:08
1

names(my_tables) <- pmcids this is how im adding label to the list ,I hope im doing it right as i checked randomly the labels are fine. – PesKchan Jan 31 '21 at 06:03
1

That is totally fine for approach and should work. Alternatively, if you want to, you can use `set_names` (see my edited answer). This would essentially take the vector of PMCIDs, name each element with the same PMCID, and pass through to getting document text, and eventually the tables. – Ben Jan 31 '21 at 15:08
yes that works absolutely fine now i m trying to combine all those output into a data table in a markdown. Passing the list to markdown is difficult it seems so im trying to read the saved output into markdown.. – PesKchan Jan 31 '21 at 15:15

Parsing of PMCID table row to column form

1 Answers1

Linked