This is a re-edited post - the original post wasn't clear enough, but I hope this one is sufficiently improved
I have approximately 400 .csv files, all of which have the same number of columns (7 in total). One file is generated every day (hence they are separate files, and I would prefer to keep them that way for the moment). Due to an issue whilst getting the data, a bunch of these files (around 30 or so consecutive files) are missing data in one of the columns: Programme_Duration
. There is a very good chance that this data exists in one or more of the other "complete" files (I won't go into detail as to how/why, but there is a lot of repetition in the data).
Here are examples of the files in question (I'm not quite sure how else to share the data because some of the observations are very lengthy strings. Hopefully these pictures will suffice):
An example of an "incomplete" file, with observations missing for Programme_Duration
.
Before I go into the approach I'm working on, it's worth pointing out that the Programme_Synopsis_url
column will match across both the complete and incomplete files. As such, this is probably the key to solving this. I.e. to create a script that:
- Creates a data frame that combines all 370 "complete" csv files (let's call it df_complete).
- Reads in the first "incomplete" file (let's call it
incomplete_file
). - Identifies any matching rows between
df_complete
andincomplete_file
for theProgramme_Synopsis_url
column - If there is a match, then copy the contents of the relevant row from
Programme_Duration
fromdf_complete
into the corresponding row forProgramme_Duration
of theincomplete_file
. - Write out.
- Repeat, i.e. loop through all 30 "incomplete" files.
Some of this I can do (steps 1, 2 5 and 6!) but the important middle part has me stumped. Hopefully the post if clearer this time around. Any help with this would be much appreciated!
UPDATE: SOLVED
In case anyone comes across this post and is in a similar situation, I thought I would update this with my solution. Full disclosure, it's not a neat or tidy solution. I have little to no experience of programming, and everything I've done below is cobbled together through a combination of trial and error, and from different sources online. I'm particularly grateful to @kstew whose answer helped me figure this out.
One other disclaimer before I share the code that worked for me: this was a very unique and unusual problem. I don't think I did a particularly good job of explaining it in my original post (largely because of my ignorance of this field). Indeed, there were several other important factors/challenges I faced whilst doing this - for example, I had to retain the original order of the rows in the "incomplete" files (I solved this by simply adding a new column called index
, so that the original order of the rows could be restored).
Again, this code will probably look like a complete mess to more experienced programmers, but it worked for me. Having said that, feel free to edit / tidy it up!
Here's the code, with explanations for each step:
### First, create data.frame from "complete" csv files ###
folder_complete <-"insert path here"
df_list_complete <- list.files(path=folder_complete, pattern="*.csv", full.names = TRUE)
df_complete = ldply(df_list_complete, read_csv)
### Then, read in and edit "incomplete" files one at a time using for loop ###
### Note "incomplete" files are in a different director - this was set during the session ###
filenames <- dir(pattern = "*.csv")
for (i in 1:length(filenames)) {
tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
### Merge / Identify matches between "complete" data.frame and "incomplete"
file ###
### using "Programme Synopsis" as the unique column ###
tmp_new <- merge(tmp, df_complete, by = "Programme_Synopsis")
### Delete any rows with NAs in specific columns - ###
### I did this because the previous step matched empty rows for these columns, and I didn't want these ###
tmp_new <- distinct(tmp_new,Programme_Synopsis_url.x, .keep_all = TRUE)
tmp_new <- distinct(tmp_new,Programme_Duration.y, .keep_all = TRUE)
### Delete Duplicate columns - merging created several duplicate columns (.y, .x) ###
### I only wanted to add the matching "Programme Duration" column from the "complete" data.frame to the "incomplete" file ###
### but wasn't sure how to do this. ###
### Instead, I had to retrospectively remove the duplicate columns ###
tmp_new <- tmp_new[ -c(2:7) ]
### Rename columns ###
tmp_new2 <- rename(tmp_new, c("Programme_Synopsis_url.y" =
"Programme_Synopsis_url",
"Programme_Duration.y" = "Programme_Duration",
"Programme_Category.y" = "Programme_Category",
"Programme_Availability.y" = "Programme_Availability",
"Programme_Genre.y" = "Programme_Genre",
"Programme_Title.y" = "Programme_Title"))
### Merge (again!) using plyr Join function ###
df <- join(tmp_new2, tmp, by = "Programme_Synopsis_url", type = "full")
### Delete any without an index ###
### (i.e. those that don't belong in this dataframe) ###
df <- df[!is.na(df$index), ]
### Re-order by original index ###
df <- df[order(df$index), ]
### Remove duplicated index columns ###
df$index.x <- NULL
df$index.y <- NULL
### Write out the new file ###
write.csv(df, filenames[[i]], row.names = FALSE)
Anyway, hope this update helps someone else. Equally, if you see this and can think of a more elegant solution, I'd love to hear it.