0

This is a re-edited post - the original post wasn't clear enough, but I hope this one is sufficiently improved

I have approximately 400 .csv files, all of which have the same number of columns (7 in total). One file is generated every day (hence they are separate files, and I would prefer to keep them that way for the moment). Due to an issue whilst getting the data, a bunch of these files (around 30 or so consecutive files) are missing data in one of the columns: Programme_Duration. There is a very good chance that this data exists in one or more of the other "complete" files (I won't go into detail as to how/why, but there is a lot of repetition in the data).

Here are examples of the files in question (I'm not quite sure how else to share the data because some of the observations are very lengthy strings. Hopefully these pictures will suffice):

An example of a "complete" file, with observations for Programme_Duration (note: not every row has an observation).

An example of an "incomplete" file, with observations missing for Programme_Duration.

Before I go into the approach I'm working on, it's worth pointing out that the Programme_Synopsis_url column will match across both the complete and incomplete files. As such, this is probably the key to solving this. I.e. to create a script that:

  1. Creates a data frame that combines all 370 "complete" csv files (let's call it df_complete).
  2. Reads in the first "incomplete" file (let's call it incomplete_file).
  3. Identifies any matching rows between df_complete and incomplete_file for the Programme_Synopsis_url column
  4. If there is a match, then copy the contents of the relevant row from Programme_Duration from df_complete into the corresponding row for Programme_Duration of the incomplete_file.
  5. Write out.
  6. Repeat, i.e. loop through all 30 "incomplete" files.

Some of this I can do (steps 1, 2 5 and 6!) but the important middle part has me stumped. Hopefully the post if clearer this time around. Any help with this would be much appreciated!

UPDATE: SOLVED

In case anyone comes across this post and is in a similar situation, I thought I would update this with my solution. Full disclosure, it's not a neat or tidy solution. I have little to no experience of programming, and everything I've done below is cobbled together through a combination of trial and error, and from different sources online. I'm particularly grateful to @kstew whose answer helped me figure this out.

One other disclaimer before I share the code that worked for me: this was a very unique and unusual problem. I don't think I did a particularly good job of explaining it in my original post (largely because of my ignorance of this field). Indeed, there were several other important factors/challenges I faced whilst doing this - for example, I had to retain the original order of the rows in the "incomplete" files (I solved this by simply adding a new column called index, so that the original order of the rows could be restored).

Again, this code will probably look like a complete mess to more experienced programmers, but it worked for me. Having said that, feel free to edit / tidy it up!

Here's the code, with explanations for each step:


### First, create data.frame from "complete" csv files ###
folder_complete <-"insert path here"
df_list_complete <- list.files(path=folder_complete, pattern="*.csv", full.names = TRUE)
df_complete = ldply(df_list_complete, read_csv)

### Then, read in and edit "incomplete" files one at a time using for loop ### 
### Note "incomplete" files are in a different director - this was set during the session ###
filenames <- dir(pattern = "*.csv")
for (i in 1:length(filenames)) {
    tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
    ### Merge / Identify matches between "complete" data.frame and "incomplete" 
    file ### 
    ### using "Programme Synopsis" as the unique column ###
    tmp_new <- merge(tmp, df_complete, by = "Programme_Synopsis")
    ### Delete any rows with NAs in specific columns - ###
    ### I did this because the previous step matched empty rows for these columns, and I didn't want these ###
    tmp_new <- distinct(tmp_new,Programme_Synopsis_url.x, .keep_all = TRUE)
    tmp_new <- distinct(tmp_new,Programme_Duration.y, .keep_all = TRUE)
    ### Delete Duplicate columns - merging created several duplicate columns (.y, .x) ###
    ### I only wanted to add the matching "Programme Duration" column from the "complete" data.frame to the "incomplete" file ###
    ### but wasn't sure how to do this. ###
    ### Instead, I had to retrospectively remove the duplicate columns ###
    tmp_new <- tmp_new[ -c(2:7) ]
    ### Rename columns ###
    tmp_new2 <- rename(tmp_new, c("Programme_Synopsis_url.y" = 
    "Programme_Synopsis_url", 
    "Programme_Duration.y" = "Programme_Duration",
    "Programme_Category.y" = "Programme_Category", 
    "Programme_Availability.y" = "Programme_Availability", 
    "Programme_Genre.y" = "Programme_Genre", 
    "Programme_Title.y" = "Programme_Title"))
    ### Merge (again!) using plyr Join function ###
    df <- join(tmp_new2, tmp, by = "Programme_Synopsis_url", type = "full")
    ### Delete any without an index ###
    ### (i.e. those that don't belong in this dataframe) ###
    df <- df[!is.na(df$index), ]
    ### Re-order by original index ###
    df <- df[order(df$index), ]
    ### Remove duplicated index columns ###
    df$index.x <- NULL
    df$index.y <- NULL
    ### Write out the new file ###
    write.csv(df, filenames[[i]], row.names = FALSE)

Anyway, hope this update helps someone else. Equally, if you see this and can think of a more elegant solution, I'd love to hear it.

Japes
  • 209
  • 1
  • 10
  • You're best bet would be to load each data frame into a list element, then do a join on all of the data frames. [This post](https://stackoverflow.com/a/34393416/11810235) explains how to merge a list of data frames. – rpolicastro Aug 23 '19 at 17:22

2 Answers2

0

Writing as an answer because it would be too long for a comment.

I prefer using data.table when I have large datasets or have to read a lot of files, it is fast and memory efficient. Below code reads all the 'complete' files into a list and then combines the list into one big data.table:

Code

library(data.table)

# assuming you have all csv files in the same location
basedir <- choose.dir()
fnames <- dir(path = basedir, pattern = '.*csv', all.files = T, full.names = T, recursive = F)

# names for the columns you want to read, using alphabets as an example 
column_names <- LETTERS[1:7]

big_list <- lappply(fnames, function(fname){
  dat <- fread(file = fname, select = 1:7, col.names = column_names)

  # test for empty column, say, column B 
  if( dat[!is.na(B), .N] < nrow(dat)){
    dat$type <- 'imcomplete'
  }else{
    dat$type <- 'complete'
  }

})

# combine them all into one list
big_data <- rbindlist(l = big_list, use.names = T, fill = T)

# set column B as the key
setkey(big_data, 'B')

complete <- big_data[type == 'complete']
incomplete <- big_data[type == 'incomplete']

Alternatively you could simply not read the files that are 'incomplete'.

For filling in missing data, you could use many different ways, I'm not sure about the logic you want to use. For instance, you could merge only the incomplete part with a different complete dataset.

A data.table approach based on key is described here: Fill missing values from another dataframe with the same columns (one-liner). I'm not sure about what logic you want to use so I'm making some up here:

# sample for complete
complete <- as.data.table(mtcars)

# sample for incomplete
incomplete <- as.data.table(mtcars[1:20, ])

# set some values to NA - examples of missing data
incomplete[runif(5, 1, .N), mpg := NA]

Verify:

> incomplete[is.na(mpg), .N]
[1] 5

Merge based on wt and qsec:

# I used the on argument on two variables 
# because both wt and qsec are not unique for each observation
setDT(incomplete)[complete, mpg := i.mpg, on = .(wt, qsec)]

result:

> incomplete[is.na(mpg), .N]
[1] 0

# Verifying that we have the right values filled in
> identical(complete[1:20, ], incomplete)
[1] TRUE

For writing the results out, you can use fwrite(complete, 'complete.csv'). You may omit the type column if you want.

Gautam
  • 2,597
  • 1
  • 28
  • 51
0

I would take a different approach from Gautam. Since your CSV files have the same number of columns, you can read in all files at once, use their filename (assuming that this is unique by date) as an ID of sorts, and unnest them into one large data frame. Then, you can filter out the incomplete cases by your desired columns (in this case duration), and replace the missing values based on data present in the complete cases and the matching column (i.e., URL).

# list CSV files in your directory
l <- list.files('./','.*csv',full.names = F)

# read in files, all cols as character, and map to one df
df <- data.frame(filename=l) %>% 
  mutate(cont = map(filename,
                    ~ read_csv(file.path(.),col_types = cols(.default = 'c')))) %>% 
  unnest(.)

# find cases with missing duration
incomp <- df %>% mutate(duration=as.numeric(duration)) %>% 
  filter(is.na(duration)) %>% rename(duration.old=duration)

# find cases with present duration
comp <- df %>% mutate(duration=as.numeric(duration)) %>% 
  filter(!is.na(duration))

# replace missing duration based on matching complete cases
comp %>% distinct(filename,day,url,duration) %>% left_join(incomp,.)

  filename day url duration.old duration
1 day2.csv   2   i           NA       22
2 day6.csv   6   j           NA       26
3 day6.csv   6   j           NA       12
4 day6.csv   6   j           NA       16

You can then write out the newly 'completed' files one-by-one using the d_ply approach below.

# data used
set.seed(123)
df <- data.frame(day=sample(1:10,100,T),
                 url=sample(letters[1:10],100,T),
                 duration=sample(c(10:30,NA),100,T))
d_ply(df,.(day),function(x) write_csv(x,paste0('day',x %>% distinct(day),'.csv')))
kstew
  • 1,104
  • 6
  • 21
  • Thanks, I think this approach might work (as will @Gautams - there's more than one way to skin a cat, as they say!) I might need to modify it slightly though - my post probably could have been clearer about the missing values from the `Programme_Duration` column. The values in this column aren't present in every row, in fact they appear quite sporadically in the "complete" files. I have a feeling your approach (which searches for any instance of missing values) therefore might need to be modified somehow. Hope that makes sense? Either way, I might not get a chance to try this until next week.. – Japes Aug 23 '19 at 19:29
  • On second thoughts, I think I've just gotten my head around your approach. You're suggesting that I should search for all missing values, then search to see if those values exist elsewhere, and, if they do, insert them into the missing rows? Incidentally, it's quite easy to have the complete and incomplete files saved to different directories if this makes the whole process easier? – Japes Aug 23 '19 at 19:43
  • Hi @Japes, I think yes to your first question. As I understand the issue in your OP, you simply need to replace missing values in certain columns with values that are already present elsewhere, matching on some desired criteria (i.e., if the URL's are the same, replace NA with that URL's programme duration). If that's the case, you don't necessarily need to first sort your files into complete/incomplete. Once you have replaced the missing values, then you can proceed with the rest of your workflow. – kstew Aug 23 '19 at 20:03
  • In my mind, first approaching the files as complete/incomplete and labeling them so is an unnecessary step. Unless I am misunderstanding your expected output (e.g., over-writing the incomplete files after you have replaced the missing values)? – kstew Aug 23 '19 at 20:05
  • yep, you've nailed exactly what I'm trying to achieve. I don't have a background in programming (I'm a TV scholar!) so I'll bow to your better judgement here. I assumed that separating the complete/incomplete files into different directories would help, but I can see now that it'll just add an extra step. Thanks for the clarifying. I'll give this a go as soon as I get a chance and will let you know how I get on – Japes Aug 23 '19 at 20:10
  • @Japes if you read in all files, you could also just do `big_data[, .(mean(B), mean(C), mean(D), mean(E), mean(F), mean(G)), by = .(A) ]` assuming `A` is unique and other columns might have more than one value. it doesn't have to be the same for each column, some (or all) could be set to `sum` instead or you could use a custom function too (that returns the frequency of the most repeated value). The options really are up to you with this kind of approach. – Gautam Aug 23 '19 at 21:06
  • @kstew Well, I've tried (for several hours) and failed. Probably because I'm a complete novice with these things. I made several adjustments to your code (e.g. the column in question is actually called `Programme_Duration` and contains some characters). Even having done that, I'm not having success with the penultimate step: replace missing duration based on matching complete cases. For one thing, there is no column called `day`, which makes the final writing out step a bit too confusing for me! I appreciate all the help, but I think this task is a bit beyond me. – Japes Aug 27 '19 at 11:15