Using R to analyse pubmed articles. Trying to create wordcloud but also association with year of publication

Question

MOST RECENT EDIT:

I have successfully created my required data frames containing pmid,year and abstract as columns from a literature search on pubmed. I then split this data frame into many separate ones by year. So I have multiple data frames containing 3 columns,pmid, year and abstract.In total there are 4000 rows across all data frames.

Now I need to run tm package to cleanup my abstract columns and remove words I don't need and punctuations etc. But I don't know how to do this on data frame. I get how it works on a text file.

I want to output frequencies of words appearing in the text. This is so I can create a graph of words by year. I then want to create a wordlcloud using wordclou2.

I am happy to use any other suggested packages.

Here is my code:

library(easyPubMed)
library(dplyr)
library(kableExtra)

# Query PubMed
qr1 <- get_pubmed_ids("platinum resistant AND cancer")

# How many records are there?
print(qr1$Count)

# Query pubmed and fetch many results
my_query <- 'platinum resistant AND cancer' 
my_query <- get_pubmed_ids(my_query)

# Fetch data, note retmax is 7000 as for some reason we need a value and a higher value returns errors
my_abstracts_xml <- fetch_pubmed_data(my_query, retstart = 0, retmax = 7000)  

# Store Pubmed Records as elements of a list
all_xml <- articles_to_list(my_abstracts_xml)

# Starting time: record
t.start <- Sys.time()

# Perform operation (use lapply here, no further parameters)
final_df <- do.call(rbind, lapply(all_xml, article_to_df, 
                                  max_chars = -1, getAuthors = FALSE))

# Final time: record
t.stop <- Sys.time()

# How long did it take?
print(t.stop - t.start)

# Show an excerpt of the results
final_df[,c("pmid", "year", "abstract")]  %>%
  head() %>% kable() %>% kable_styling(bootstrap_options = 'striped')

#redue columns to those requiredfor overall wordcloud
wordcloud_df <- final_df[,c('pmid','year','abstract')]

#split df by year for analysis by year
df2022 <- wordcloud_df[which(wordcloud_df$year == "2022"),]
df2021 <- wordcloud_df[which(wordcloud_df$year == "2021"),]
df2020 <- wordcloud_df[which(wordcloud_df$year == "2020"),]
df2019 <- wordcloud_df[which(wordcloud_df$year == "2019"),]
df2018 <- wordcloud_df[which(wordcloud_df$year == "2018"),]
df2017 <- wordcloud_df[which(wordcloud_df$year == "2017"),]
df2016 <- wordcloud_df[which(wordcloud_df$year == "2016"),]
df2015 <- wordcloud_df[which(wordcloud_df$year == "2015"),]
df2014 <- wordcloud_df[which(wordcloud_df$year == "2014"),]
df2013 <- wordcloud_df[which(wordcloud_df$year == "2013"),]
df2012 <- wordcloud_df[which(wordcloud_df$year == "2012"),]
df2011 <- wordcloud_df[which(wordcloud_df$year == "2011"),]
df2010 <- wordcloud_df[which(wordcloud_df$year == "2010"),]
df2009 <- wordcloud_df[which(wordcloud_df$year == "2009"),]
df2008 <- wordcloud_df[which(wordcloud_df$year == "2008"),]
df2007 <- wordcloud_df[which(wordcloud_df$year == "2007"),]
df2006 <- wordcloud_df[which(wordcloud_df$year == "2006"),]
df2005 <- wordcloud_df[which(wordcloud_df$year == "2005"),]
df2004 <- wordcloud_df[which(wordcloud_df$year == "2004"),]
df2003 <- wordcloud_df[which(wordcloud_df$year == "2003"),]
df2002 <- wordcloud_df[which(wordcloud_df$year == "2002"),]
df2001 <- wordcloud_df[which(wordcloud_df$year == "2001"),]
df2000 <- wordcloud_df[which(wordcloud_df$year == "2000"),]
df1999 <- wordcloud_df[which(wordcloud_df$year == "1999"),]
df1998 <- wordcloud_df[which(wordcloud_df$year == "1998"),]
df1997 <- wordcloud_df[which(wordcloud_df$year == "1997"),]
df1996 <- wordcloud_df[which(wordcloud_df$year == "1996"),]
df1995 <- wordcloud_df[which(wordcloud_df$year == "1995"),]
df1994 <- wordcloud_df[which(wordcloud_df$year == "1994"),]
df1993 <- wordcloud_df[which(wordcloud_df$year == "1993"),]
df1992 <- wordcloud_df[which(wordcloud_df$year == "1992"),]
df1991 <- wordcloud_df[which(wordcloud_df$year == "1991"),]
df1990 <- wordcloud_df[which(wordcloud_df$year == "1990"),]
df1989 <- wordcloud_df[which(wordcloud_df$year == "1989"),]
df1988 <- wordcloud_df[which(wordcloud_df$year == "1988"),]
df1987 <- wordcloud_df[which(wordcloud_df$year == "1987"),]
df1986 <- wordcloud_df[which(wordcloud_df$year == "1986"),]
df1985 <- wordcloud_df[which(wordcloud_df$year == "1985"),]
df1984 <- wordcloud_df[which(wordcloud_df$year == "1984"),]
df1983 <- wordcloud_df[which(wordcloud_df$year == "1983"),]
df1982 <- wordcloud_df[which(wordcloud_df$year == "1982"),]
df1981 <- wordcloud_df[which(wordcloud_df$year == "1981"),]
df1980 <- wordcloud_df[which(wordcloud_df$year == "1980"),]
df1979 <- wordcloud_df[which(wordcloud_df$year == "1979"),]
df1978 <- wordcloud_df[which(wordcloud_df$year == "1978"),]
df1977 <- wordcloud_df[which(wordcloud_df$year == "1977"),]
df1976 <- wordcloud_df[which(wordcloud_df$year == "1976"),]

ORIGINAL POST:I am very new to programming and R in general. As part of my project, I would like to create a wordcloud which I have managed to test and get working (need to clean it properly still). But I want to now do something different.

If I were to search for my terms on pubmed, I will get roughly 7000 articles. I'm able to download all abstracts to my computer, stick them in a txt file and then make my wordcloud (just about).

However now I want to correlate the terms I find with frequency of said terms over the years. This way I can see how research is directed/changing over the years. This is where I am stuck however.

Whilst I can get the abstracts, how do I somehow associate each abstract with a year then get a frequency per year?

I found the easypubmed package but I don't think I'm able to do what I want with it. Any suggestions?

Thank you!

(I'm using wordcloud2 +tm currently)

I have tried to run easypubmed but I'm not quite sure how to get it to do what I want. it may not even be the right package. I have tried to download directly from pubmed but I cannot download both the abstract + year and as a separate file. There is an option not download an excel file but this will only contain year author, pubmedID and a couple other bits. Not the abstract. Otherwise I probably could have used the excel file?

Why don't you just get all the abstracts associated with their PMIDs, and then get the article years associated with their PMIDs, and use a join to link them by their PMID? — dcsuka, Oct 29 '22 at 21:41
Thank you! How would I do this (I'm still very new to it all!) Just a pointer and I can figure out the rest hopefully. And if it's a process that I have to do per article, a way to automate it for all 7000? — Aidi, Oct 29 '22 at 21:55
When you "download all abstracts" is the file title for each of them the PMID? Please elaborate about where the pmid is in relation to all these data objects that you can access. — dcsuka, Oct 29 '22 at 21:59
Okay so these are the ways I can download from pubmed:
1) download abstracts as a txt file and it will look something like this:

1. Bioorg Chem. 2019 Jul;88:102925. ..........
"big abstract text here".....
DOI: 10.1016/j.bioorg.2019.102925 PMID: 31003078 [Indexed for MEDLINE
That is one entry, on the next line in the txt file will be the next entry and so on.
2) I can download a csv file and if I open in excel, it will have columns of PMID, title, author etc but no abstract.
option 1 has everything I need but no association? — Aidi, Oct 29 '22 at 22:07
sorry, I'm trying to figure out how to format the comment with line breaks — Aidi, Oct 29 '22 at 22:11

dcsuka · Accepted Answer · 2022-10-30T17:16:45.553

0

Here is a possible implementation using regular expressions to extract the PMID from the text, for later joining with the other csv file:

library(tidyverse)

#Fill below in
txtpath <- "/Users/davidcsuka/Downloads/abstract-coaptite-set.txt"

textdf <- read_file(txtpath) %>%
  str_split("(?=\\n\\d\\. )") %>%
  unlist() %>%
  setNames(str_extract(., "(?<=PMID: )\\d+")) %>%
  enframe(name = "PMID")

#Fill below in
csvpath <- "/Users/davidcsuka/Downloads/csv-coaptite-set.csv"

df <- read_csv("/Users/davidcsuka/Downloads/csv-coaptite-set.csv") %>%
  mutate(PMID = as.character(PMID)) %>%
  left_join(textdf, by = "PMID")

view(df)

You could later run the text mining functions on the text column of the dataframe. Or you could just process the text files as a vector, then make it a dataframe later. Let me know if this works.

EDIT:

dfnew <- df %>%
  group_by(year) %>%
  summarise(newtext = paste(abstract))



textmine <- function(onetext) {
  #Write text mining function for one article
}

#verify function with:
textmine(dfnew$newtext[1])

#get all results with:

results <- lapply(dfnew$newtext, textmine)

edited Oct 30 '22 at 17:16

answered Oct 29 '22 at 22:11

dcsuka

2,922
3
6
27

For this to work I would need to have a csv file with the PMID and a txt file with the abstract right? and it will join them? In case it simplifies it - I've found a way to at least see my results within R now: I followed this ) specifically demo 1): https://www.data-pulse.com/projects/Rlibs/vignettes/easyPubMed_02_advanced_tutorial.html Within R viewer, I have paid, year and abstract now. I can now loop this and get all 7000 results right? but how do I go from here to using tm and wordcloud2? – Aidi Oct 29 '22 at 22:48
You can obtain that csv easily on PubMed's website. Just make a query, hit save, and choose csv. Just `lapply` or `map` the `tm` function to the vector of texts that my code creates, skip the setNames and enframe step to see just the vector. – dcsuka Oct 29 '22 at 22:51
Hello, I ran the code and changed the file locations and it seemed to have worked without issue but its not done what I want to exactly. So from what I understand, running this code basically brings up a df extracted from the csv file and a textdf from the txt file. the df has 7536 lines with a neat table but without the abstracts. the textdf has just 19 lines and the second column is basically everything else including abstract. If it helps, on pubmed you can save as "pubmed" which is a txt file that contains all information for each article in one place. – Aidi Oct 30 '22 at 08:35
Okay, this is where I have got to using easypubmed and its worked. I followed Demo1 on this website: data-pulse.com/projects/Rlibs/vignettes/… So now I have a data frame with the year and abstract on Studio. My next steps are to twofold. I need to text mine to find frequency of words within abstract (to create a wordcloud later) bu how do I do this from data frames? What I want to also do is to associate each word found from text mining abstract with the year of appearance. This is to graph the change in appearance of words over the years. How would I do this? – Aidi Oct 30 '22 at 09:37
I have edited with some more code. Just clump text from the same year together. Then text mine that. – dcsuka Oct 30 '22 at 17:17
Okay, I was having some problems with the code (due to my limited understanding of R really. and time). I'm going to edit the organa post with my code and where I have got to with where I want to go. Basically I am at the point now where I have successfully made DataframeSource containing 2 columns - year and abstract and I have also managed to separate them into individual dataframes. I don't actually know how to use the tm package to cleanup text from data frames...I managed before from a text file but don't understand how to do it to a data frame column. – Aidi Oct 31 '22 at 16:56
Write a function which works for one "text", then lapply it to everything just as I show with my code. Make sure it works for `dfnew$text[1]`, which you say you already can do. – dcsuka Oct 31 '22 at 17:03
Thank you. I am unsure of how to write the text mine functions for a data frame. I tried the following on a data frame directly to test:`df1976_1 <-Corpus(DataframeSource(df1976), readerControl = list(language ="lat")) dtm <- TermDocumentMatrix(df1976_1) matrix <- as.matrix(dtm) df1976_1 <- tm_map(df1976_1, removePunctuation)` But it gives me this error: `'dtm <- TermDocumentMatrix(df1976_1) matrix <- as.matrix(dtm) df1976_1 <- tm_map(df1976_1, removePunctuation)'.` I understand the error being I don't have a doc id column in my data frame. – Aidi Oct 31 '22 at 18:34
I'd rather not add another column that if there is another way. If I do need to do that, how do I add a new column into my data frames and reorder the columns? – Aidi Oct 31 '22 at 18:35
https://stackoverflow.com/questions/47406555/error-faced-while-using-tm-packages-vcorpus-in-r Try colnames(df1976) <- c("doc_id", "year", "text"), you don't have to split into a million dfs, just make a function that does one df apply to all, then `group_by` year to analyze. I recommend reading R for data science by Hadley Wickham. See `unnest` function for later reference too, to widen the matrix of results. – dcsuka Oct 31 '22 at 19:10
That worked. I'll give the code you have written a go with the functions and see how I get on. Thank you for the book suggestion too. When I embarked on this literature review, this wasn't part of the initial plan and it's quite late in to the reject hence the rush! – Aidi Oct 31 '22 at 20:31
I've made some good progress and have a draft wordcloud but now I have another question! When I clean up text to obtain frequency of words, is there a way to obtain frequency of words appearing in a certain order? For example, there is some text that says xxxxxxxxx bladder cancer xxxxx. What I really want to find is the freq. of "bladder cancer". At the moment I have individual words which is useful but also not at the same time. I'm really trying to find out how often "x cancer" is appearing in the text where x is something I do not know but want to know (bladder was an example). – Aidi Nov 01 '22 at 19:45
At present I have all my abstracts in a single column (of 3) in a data frame as before. – Aidi Nov 01 '22 at 19:45
This should be asked as a separate question, feel free to link it if you still need help. – dcsuka Nov 06 '22 at 01:23
I just wanted to say thank you for all your help dcsuka. I've made some good progress now – Aidi Nov 07 '22 at 18:07
Glad to hear it Aidi, if you have any more pubmed questions feel free to link the new question here. – dcsuka Nov 07 '22 at 18:08

Using R to analyse pubmed articles. Trying to create wordcloud but also association with year of publication

1 Answers1