1

I'm new to R (as well as stackoverflow, hence the bullets just represent new lines) and am assigned to work on a project in which I need to clean MEDLINE data into a neat dataframe. An example of what the raw .txt file looks like is:

 PMID- 28152974 
 OWN - NLM 
 IS  - 1471-230X (Electronic) 
 IS  - 1471-230X (Linking) 
 PMID- 28098115 
 OWN - NLM 
 IP  - 1 
 VI  - 28 

etc.

Each new observation starts with PMID, and not all of the variables are included in each observation, and some cells with the same column name in the same observation need to be merged (ie IS). The final data frame should look like:

 PMID      OWN  IS                                          VI
 28152974  NLM  1471-230X (Electronic) 1471-230X (Linking)  N/A 
 28098115  NLM  N/A                                         28 

etc.

Currently I've manipulated my data in many ways. The first is in the format of the raw data file, but in two columns, without the "-". ex:

 PMID 28152974
 OWN  NLM
 IS   1471-230X (Electronic) 
 IS   1471-230X (Linking) 
 PMID 28098115 
 OWN  NLM 
 IP   1 
 VI   28 

etc.

The second is all of the observations all in just one row with thousands of columns for each variable. ex:

 PMID      OWN   IS                      IS                   PMID      OWN
 28152974  NLM   1471-230X (Electronic)  1471-230X (Linking)  28098115  NLM

etc.

The third is similar to the second but instead of thousands of columns it only has the distinct column types from the first PMID values. ex:

 PMID               OWN      IS 
 28152974 28098115  NLM NLM  1471-230X (Electronic) 1471-230X (Linking)

etc.

Please help. I don't know how to splice my data and don't know which manipulation I should work with.

sweetmusicality
  • 937
  • 1
  • 10
  • 27
  • Concerning your third step, I suppose you finally want one row for each distinct `PMID`? Which data fields do you want to keep and which can be ignored? – Uwe Feb 16 '17 at 07:45

3 Answers3

1

Reproducible data:

d <- c("PMID- 28152974", "OWN - NLM", "IS  - 1471-230X (Electronic)", 
       "IS  - 1471-230X (Linking)", "PMID- 28098115", "OWN - NLM", "IP  - 1", 
       "VI  - 28")

Input from file:

d <- readLines('/path/to/file')

One idea:

# split into records
i <- grepl("^PMID", d)
i <- cumsum(i)
d <- split(d, i)

# split into key-value pairs
d <- lapply(d, strsplit, "\\ {0,2}-\\ ")
d <- lapply(d, function (x) setNames(sapply(x, '[[', 2), sapply(x, '[[', 1)))

# merge IS variables
d <- lapply(d, function (x) {
  i <- names(x) == "IS"
  if (any(i))
     x <- c(x[!i], IS = paste(x[i], collapse = " "))
  return(x)
})

# merge records to data.frame
library(data.table)
d <- lapply(d, as.list)
d <- lapply(d, as.data.table)
d <- rbindlist(d, fill = T)
d <- as.data.frame(d)
setempler
  • 1,681
  • 12
  • 20
0

It is not uncommon that data of different kind are mixed up in one or two columns of a data file. As long as the different kind of data can be identified in some way, e.g., by a regular expression, the contents of the rows can be moved to different columns.

The following solution uses read_fwf() from package readr to read the fixed width data from the text file (here simulated by reading from a character string). dcast() from the data.table package is used to reshape from long to wide format which yields a data.frame with one record per row:

Read data

library(data.table)
# read data 
dt <- readr::read_fwf(
  " PMID- 28152974 
 OWN - NLM 
 IS  - 1471-230X (Electronic) 
 IS  - 1471-230X (Linking) 
 PMID- 28098115 
 OWN - NLM 
 IP  - 1 
 VI  - 28 ", 
  readr::fwf_positions(c(2, 8), c(5, Inf), c("variable", "value")),
  col_types = "cc")
# coerce tibble to data.table
setDT(dt)

Reshape from long to wide format

# create new column PMID with the record id
dt[variable == "PMID", PMID := value]
# fill missing values in subsequent rows to mark all rows belonging to one record
dt[, PMID := zoo::na.locf(PMID)]
dt
#   variable                  value     PMID
#1:     PMID               28152974 28152974
#2:      OWN                    NLM 28152974
#3:       IS 1471-230X (Electronic) 28152974
#4:       IS    1471-230X (Linking) 28152974
#5:     PMID               28098115 28098115
#6:      OWN                    NLM 28098115
#7:       IP                      1 28098115
#8:       VI                     28 28098115

# reshape from wide to long, thereby collapsing strings if necessary
dcast(dt[variable != "PMID"], PMID ~ ..., fun = paste, collapse = " ")
#       PMID IP                                         IS OWN VI
#1: 28098115  1                                            NLM 28
#2: 28152974    1471-230X (Electronic) 1471-230X (Linking) NLM   

Note that this approach is quite flexible as it collapses all duplicate data fields if they appear in the data no matter how they are being named, not just the IS column.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • I copied and pasted exactly your code and I did not get the desired results...instead I got # PMID IP IS OWN PMI VI #1: 28152974 1 1471-230X (Electronic) 1471-230X (Linking) NLM NLM 28098115 28 exactly like that...one row only, and duplicated too (and PMI not PMID...) Also, to clarify, the .txt contains thousands of rows of data, and thus hundreds of PMIDs which come at random intervals – sweetmusicality Feb 16 '17 at 20:21
  • Pease, can you verify in your txt file that the data fields are vertically aligned, i.e., that the hyphen `-` appears at the same position in each row and that `PMID` and the other field names start always at position 2 with one leading blank at position 1? Your result looks like as if the second PMID starts at position 3 (with 2 leading blanks). If the final `D` of `PMID` is located at the position of `-` it will be cut off. – Uwe Feb 16 '17 at 21:45
  • And, yes the proposed solution is capable to handle large files with PMIDs at random intervals. And it should be sufficiently fast due to `data.table`. – Uwe Feb 16 '17 at 21:50
  • @sweetmusicality Hope you have made some progress. Do you have an update on the issue with reading the file in fixed width format? – Uwe Feb 21 '17 at 14:48
  • Hi @Uwe Block. I'm not sure why previously your solution did not work. I used @setempler's solution while I was working with a subset for the past few months, but now I need to read in 5 million observations, and that code crashes. I am trying yours and after I try using `dcast`, I am left with this error: `Using 'PMID' as value column. Use 'value.var' to override Error: cannot allocate vector of size 674.5 Gb` Sorry for the late response, but please advise. – sweetmusicality Jul 05 '17 at 19:46
  • It seems that the `value` column is missing from your `data.table` so `PMID` is picked instead. What is the output of `str()`? – Uwe Jul 07 '17 at 06:25
0

Thanks to G. Grothendieck, I learned about the read.dcf() function from base R which simplifies this task dramatically. Only some minor adjustments are required.

# use connection to avoid warnings
con <- file("test.dat")
# read file row-wise and adjust for dcf format
dat <- sub("PMID", "\nPMID", sub("- ", ": ", trimws(readLines(con))))
# close connection to avoid warnings
close(con)

# re-read from variable using dcf format, collapse multiple entries
result <- read.dcf(textConnection(dat), all = TRUE)
result
      PMID OWN                                         IS   IP   VI  
1 28152974  NLM 1471-230X (Electronic), 1471-230X (Linking) <NA> <NA>
2 28098115  NLM                                          NA    1   28
Uwe
  • 41,420
  • 11
  • 90
  • 134