0

I have a table of reference stock symbol (20,000 rows)

enter image description here

and a folder of csv files, each CSV files is named by a stock symbol, for example ZTS.csv. Inside each csv file, is the price history of the symbol.

enter image description here

The end goal is to track performances of all stocks and visualisation. Because of the sheer size of the reference table and the csv files, I think the most sensible approach will be selecting need information from each CSV files and add them into the reference table.

For example, I would want to take a row from the reference table, symbol ZTS, showdate 2017-01-09,

Then read the ZTE.csv file, find the rows with date matching the showdates, add the open/high/low/close price data columns

Then loop this.

Due to size restrictions, I have uploaded sample data here on google drive: https://drive.google.com/drive/folders/1G3os67b2i2VfGHnvR6NX8qk1ECuVawGJ?usp=sharing

#read in the reference data

df <- read.csv("reference table.csv", header = TRUE) 


# get csv files directory and list all files in this directory

wd <- "/Users/m/Desktop/project/price_data_csv"
files_in_wd <- list.files(wd)


#find stuff to match

# create an empty list and read in all files from wd
mylist <- list()
for(i in seq_along(files_in_wd)){
  
  mylist[[i]] <- read.delim(file = files_in_wd[i],
                             sep = ',',
                             header = T)
}

I'm stuck on how to do the matching and creating combined table. Thank you

ML33M
  • 341
  • 2
  • 19
  • When dealing with a directory of CSV files, I find it is usually good to read them in as a [list of tables](https://stackoverflow.com/a/24376207/3358227) (typically using `lapply` or similar). Once in that format, considering your data already has unique-ids in each frame (`symbol` and `date`), then you could even combine them into a single frame (`do.call(rbind, list_of_frames)` or `dplyr::bind_rows(lof)` or `data.table::rbindlist(lof)`) and deal with the data with [grouping](https://stackoverflow.com/q/11562656/3358272) and similar efforts. – r2evans Sep 30 '21 at 18:12
  • While base R can do most grouping just fine, many find it may be easier (or at least more readable) to use `dplyr` for its `group_by`; alternatively, `data.table` does grouping options very well, though the `data.table` learning curve may be a little steeper than `dplyr`'s (with good results, though). – r2evans Sep 30 '21 at 18:13
  • @r2evans yes, that was what I aim to achieve. Sorry my knowledge in R is still very shallow. I'm not sure if I can read all the CSV in, there are about 3000 files of various sizes. That's why I was trying to read the reference table, find stock tickers and time, then read in one csv corresponding to that ticker, use time to match price data, then this matched dataframe will be stored as an element in a list. In the end of the day, I will have a list of wanted data before I can do call it – ML33M Sep 30 '21 at 18:19
  • @r2evans I dont know how to do that though... or in your way. Can you show me using the example data I provided in the google drive link? – ML33M Sep 30 '21 at 18:19

1 Answers1

1

I'd recommend using data.table, because, as @r2evans mentions, it does grouping well, and if your data is large, it is very fast.

Using your sample data, the could bekow should hopefully get you started (I've prefixed the data.table methods to help indicate where it's being used). You can use the provided function on an individual symbol, or try running it all at once (not sure how big your data actually is).

library(data.table)

data_dir <- "~/Downloads/Testing/"
reference_table <- data.table::fread(paste0(data_dir, "reference table.csv"));

prepare_symbol_table <- function(sym, ref) {
  # This check is only necessary if calling individually
  if(data.table::uniqueN(ref$symbol) > 1) 
    ref <- ref[symbol == sym]

  symbol_csv <- data.table::fread(paste0(data_dir, sym, ".csv"))
  data.table::merge.data.table(ref, symbol_csv, by.x = c("showdate"), by.y = c("date"))  
}

# merge a single symbol table
yum_table <- prepare_symbol_table("YUM", reference_table)

# all merged at once, reading individual CSVs by matching the symbol column from
# the reference table
all_symbols_merged <- reference_table[, {
  # symbol_csv <- data.table::fread(paste0(data_dir, symbol, ".csv"))
  # data.table::merge.data.table(.SD, symbol_csv, by.x = c("showdate"), by.y = c("date"))
  prepare_symbol_table(.BY, .SD)
}, by = c("symbol")]
clmarquart
  • 4,721
  • 1
  • 27
  • 23
  • Thank you, this answer is very slick. I've learned alot about datatable just by playing your code. the by = c("symbol") feeds in the data.table::fread(paste0(data_dir, symbol, ".csv")). Like I never know this can work this way. And I dont even know you can pass a function in reference_table[,{ },] – ML33M Oct 01 '21 at 05:36
  • I have 2 errors need to overcome, 1) is some of the symbols in the reference table are not NYSE, so I cannot retrieve csv file for price. I overcame the problem by filtering the empty csv files and remove the symbols from the reference table. But then when I ran your code, second problem happend where datatable complained about Error in `[.data.table`(reference_table, , { : Column 11 of result for group 8 is type 'double' but expecting type 'integer'. Column types must be consistent for each group. – ML33M Oct 01 '21 at 06:29
  • I think the problem is some price is 43, and some are 32.65. Is there a way to coerce all of them in your code into double so dataable will be happy? In light of the data, some date might not have a price match, therefore returning NA, I guess datatable will complain again? – ML33M Oct 01 '21 at 06:33
  • update: Ive' tried to replace any NA value with none, and round all numerics and set to double. But the datatable error complains column being logic when expected character. Is there anyway we can coerce it to accept? – ML33M Oct 01 '21 at 09:07
  • You can try setting explicitly setting the column classes when calling `fread` using the `colClasses` argument: `symbol_csv <- data.table::fread(paste0(data_dir, sym, ".csv"), colClasses = list(numeric=1, character=2, Date=3, double=4:9))` – clmarquart Oct 01 '21 at 13:25
  • MAN! IT WORKED BEAUTIFULLY! My eyes are watering when I saw the final table! Thank you for the help and knowledge you sparked! – ML33M Oct 02 '21 at 11:54