1

sorry I have looked for solutions but couldn't find what was needed. I am quite new to R and have used only matlab before (hence am still trying to work out how not to use loops).

I have a df with academic papers in it (one row per paper).

Main df

Fields                              Date       Title
Biology; Neuroscience               2016       How do we know when XXX
Music; Engineering; Art             2011       Can we get the XXX
Biotechnology; Biology & Chemistry  2007       When will we find XXX
History; Biology                    2006       Where does the XXXX

In one column ('Fields') there is a list of subject names, with multiple fields separated by a colon. I want to find all rows (papers) that have an exact match to a specific field name (e.g., 'Biology'). Then, make a new df with all those rows (papers). Importantly, however, I want not to get fields that partially match (e.g., 'Biology & Chemistry').

New df - just for those rows

Fields                              Date       Title
Biology; Neuroscience               2016       How do we know when XXX
History; Biology                    2006       Where does the XXXX

i.e., does not also select Biotechnology; Biology & Chemistry 2007 When will we find XXX which has the word 'Biology' in it

My first thought was to get each field name in its own column using splitstring, then loop through each column using which to find the exact matches for the name. Because there are up to 200 columns (field names) this takes ages! It's taking up to an hour to find and pull all the rows. I would obviously like something faster.

I know in R you can avoid loops by applying etc., but I cant think how to use that here.

This is what it looks like when I split the author names into separate columns

Field1        Field2                     Date       Title
Biology       Neuroscience               2016       How do we know when XXX

This is my code so far (note: there is a white space in front of the names once I split them up)

# Get list of columns to cycle through (they all start with 'sA')
names <- data[,grep("^sA", colnames(data))]
collist <- colnames(names)
names[collist] <- sapply(names[collist],as.character)
collist <- collist[-1]


Loop to get new df from matching rows
for (l in 1:length(namesUniq$Names)) {
  namecurr <- namesUniq$Names[l]
  namecurrSP <- paste0(" ", namecurr)

  # Get data for that field
    dfall <- data[which(data$sA1 == namecurr), ]

  for (d in 1:length(collist)) {
    dcol <- collist[d]
    dfall <- rbind(dfall, data[which(data[, dcol] == namecurrSP), ])
    rm(dcol)
  }
  rm(d)

Something that runs quickly would be really useful. Thank you for any help!

grepl does not work - it pulls other partial match strings (like 'Biology & Chemistry' when I want 'Biology' only) dfall <- subset(data, grepl(namecurr, Field, fixed = TRUE))

For some reason, which does not work when I do it this way (rows works, rows2 does not - it selects rows outside the bounds of my df)

    rows2 <- which(data[, collist[-1]] == namecurrSP)
dfall <- rbind(data[rows, ], data[rows2, ])```
  • Try `subset(df, grepl('Dempsey-Jones, H.', Names, fixed = TRUE))` – Ronak Shah Jun 04 '20 at 08:12
  • Hi there Ronak, thanks for your suggestion - unfortunately, I had tried grepl before and it doesn't get the desired result. I also have to run this with fields of research and in this list, for example: Biology; Chemistry; Biology & Neuroscience grepl('Biology', Fields, fixed = TRUE) picks up Biology and Biology & Neuroscience, where I only want Biology. So what I had to do was splitstring the different fields into different columns and search for an exact match there. – Harriet Dempsey-Jones Jun 04 '20 at 12:07
  • For some reason 'rows' works, but 'rows2' does not - rows2 indexes rows that are outside the index of the dataframe rows <- which(data[, collist[1]] == namecurr) rows2 <- which(data[, collist[-1]] == namecurrSP) – Harriet Dempsey-Jones Jun 04 '20 at 12:11
  • If I copy the data you shared above, copy it in R, and use the above code I provided it gives me the expected output that you have shown. If it doesn't work for you, you should provide a reproducible example that actually represents your data so that we can verify the results on our end. Read how to give a [reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Jun 04 '20 at 12:14

0 Answers0