extract the same column (different length) from multiple .csv and cbind in one data frame - R

Question

So I have 1300 csv-files of financial data with the following format:

          Date     Open     High      Low    Close
1 Nov 28, 2017 0.233394 0.234871 0.223832 0.225542 
2 Nov 27, 2017 0.225910 0.234219 0.212298 0.233113
3 Nov 26, 2017 0.229367 0.235126 0.215153 0.226367
4 Nov 25, 2017 0.234212 0.239257 0.223383 0.228617
5 Nov 24, 2017 0.215836 0.236280 0.209834 0.234195
6 Nov 23, 2017 0.228887 0.232974 0.214334 0.216585

My goal is to extract the column "Open" from each file and bind the columns together so it will look something like this:

     Date     "File1"   "File2"  "File3"  ... "File 1300"         
1 Nov 28, 2017 0.233394 0.234871 0.223832 ... 0.225542 
2 Nov 27, 2017 0.225910 0.234219 0.212298 ... 0.233117 
3 Nov 26, 2017 0.229367 0.235126 0.215153 ... 0.226367 
4 Nov 25, 2017 0.234212 NA       0.223383 ... 0.228617 
5 Nov 24, 2017 0.215836 NA       0.209834 ... 0.234195 
6 Nov 23, 2017 0.228887 NA       0.214334 ... NA

I know there has been a lot of questions here already on how to extract columns from multiple csv-files and how to bind them together, and I am almost there, but the problem is that the length of the files are different as I have observations based on dates. For example, one file can have observations from Des 3, 2014, while another can have from nov 23, 2017. Otherwise, the files are identical in terms of format, and they all have their last observation on Nov 28, 2017.

Below you can see my code so far

# Get a List of all files in directory 
filenames <- list.files(".../path, pattern="*.csv", full.names=F)

# Loading column "Open" from each file 
for(i in filenames){   
    filepath <- file.path(".../path", paste(i,sep=""))   
    assign(i, read.csv(filepath, header=T, sep = ";", 
        colClasses=c(NA, NA, "NULL", "NULL", "NULL"))) 
}

#making a list of all data frames
df_list <- lapply(ls(), function(x) if (class(get(x)) == "data.frame") get(x)) 

#merging  
library(dplyr) 
res2 <- Reduce(function(...) left_join(..., by=c("Date")), df_list)

As you can see I load all the csv-files into the global environment and make it to a list, where I then want to merge the columns into one data frame. The problem seems to be the merging part, where the Reduce or dplyr-package doesn't seem to solve the problem.

So my question is if you have any solutions to column bind all the Date-columns in one data frame and sorting by date? And also, are there any quick fix to make the filenames as headers for the columns?

I am also unsure if R is the best way to solve this. I am not that familiar with Python, but if you think it is easier, I can try to use that instead.

Roman Luštrik · Accepted Answer · 2018-05-20T15:08:10.507

2

This is untested. Two approaches on how to import. One is iterative with a for loop while the second one imports all the data and then uses Reduce to merge 2. object to first, third to merge of 1. and 2., forth to merge of 1., 2. and 3. and so on.

# approach #1
result <- read.csv(filenames[1], header = TRUE, sep = ";")[, c("Date", "Open")]

for (i in filenames[-1]) {
  out <- read.csv(i, header = TRUE, sep = ";")[, c("Date", "Open")]
  colnames(out) <- c("Date", basename(filenames[i]))
  result <- merge(result, by = "Date")
}


# approach #2
alldata <- lapply(filenames, FUN = function(x) {
  out <- read.csv(i, header = TRUE, sep = ";")[, c("Date", "Open")]
  colnames(out) <- c("Date", basename(x))
  out
})

result <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "Date", all = TRUE), alldata)

edited May 20 '18 at 15:08

answered May 20 '18 at 07:59

Roman Luštrik

69,533
24
154
197

Thanks for two different alternatives. The first approach doesn't seem to iterate over all the files as it only outputs "out" as the data from the second file. I think the second approach looks faster, but the merge function doesn't seem to recognize "date" as I get this error: " Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column", pretty much the same problem I got when trying my initial approach. – hl0nd May 20 '18 at 08:29
I guess one has to add a `return(out)` to the function in `lapply` of approach no. 2. – cryo111 May 20 '18 at 11:12
Thanks cryo111! This almost solved it. My only problem now is that the dates aren't in the correct order. I get for example all august dates for 2017, then all desember etc. I could just sort by date, but would be even neater if I could get it in the right order from the start. – hl0nd May 20 '18 at 13:55
I don't see a date conversion anywhere, so I assume the rows are ordered by interpreting the date column as `character`. You can try `strptime` on the date column, like `strptime("Nov 28, 2017",format = "%b %d, %Y")`. You might have to set the right locale for the abbreviated month name. – cryo111 May 20 '18 at 15:01
@hl0nd merge should take care of the ordering. – Roman Luštrik May 20 '18 at 15:09
1

Thanks guys. The date column was a bit tricky, but converting it from factor to character and then use the lubridate package to convert it to real date-column solved the ordering part. – hl0nd May 21 '18 at 08:07

eoxxs · Answer 2 · 2018-05-20T09:19:43.940

If you are also interested in a solution in Python here you go. I just tested it and it seems to work:

First of all you need to read the csv-files into pandas assuming that colums 0, 1 are Date and Open, the skipinitialspace=True just gets rid of white spaces. As we want a list containing the dataframes we loop over the csv files:

dfs = []
for files in glob.glob('*.csv'):
    dfs.append(pd.read_csv('./' + files, delimiter=';', usecols=[0, 1], skipinitialspace=True))

After reading all files you want to change the column names so that it reflects the file name:

dfs = [x.rename(columns={'Open': 'File_{}'.format(i)}) for i, x in enumerate(dfs)]

Now, courtesy to here pandas joining multiple dataframes on columns you can reduce to create a final dataframe:

df_final = reduce(lambda left,right: pd.merge(left,right,on='Date', how='outer'), dfs)

This will create a final dataframe

I just realised, that the postfix of the new column depends on the position in the list, so you may check if the files are put into the list in order. In my case they are... — eoxxs, May 20 '18 at 09:24

score 0 · Answer 3 · answered May 20 '18 at 15:17

Here another approach using a time series library (zoo).

First I am going to prepare some data similar to yours:

library(quantmod)

getSymbols(c("MSFT","FB"),src='yahoo',from='2005-01-01')
MSFT=data.frame(Dt=strftime(time(MSFT),format = "%b %d, %Y"),
                Close=unname(MSFT[,"MSFT.Close"]))
rownames(MSFT)=NULL
FB=data.frame(Dt=strftime(time(FB),format = "%b %d, %Y"),
              Close=unname(FB[,"FB.Close"]))
rownames(FB)=NULL
write.csv(MSFT,file="MSFT.csv",row.names = FALSE)
write.csv(FB,file="FB.csv",row.names = FALSE)

Now to the actual answer of your question:

#answer to original question
library(zoo)

filenames=c("MSFT.csv","FB.csv")

finDataList=lapply(filenames,function(x) 
    read.csv.zoo(x,format="%b %d, %Y",drop=FALSE))

finData=do.call("merge",finDataList)
colnames(finData)=gsub("\\.csv","",filenames)
head(finData)
#             MSFT FB
# 2005-01-03 26.74 NA
# 2005-01-04 26.84 NA
# 2005-01-05 26.78 NA
# 2005-01-06 26.75 NA
# 2005-01-07 26.67 NA
# 2005-01-10 26.80 NA

extract the same column (different length) from multiple .csv and cbind in one data frame - R

3 Answers3