I am currently in the early stages of writing a multi-faceted investment algorithm. The part I am currently working on is concerned with using a Graphical Gaussian Model with a LASSO penalty to find inter-dependencies which can be used to inform investment strategy. I am currently trying to use JAVA to pre-process historical CSV data input and create a new CSV output file with the relevant data.
The raw, small-scale example data I am using to test the processing algorithm (which will eventually be used on a Reuters Eikon live feed) is in txt/CSV format. I have a folder containing text files with historical data on many stocks on the NYSE. Although there are 8 columns, the three I am interested in (for the purposes of pre-processing before creating a covariance matrix which will feed into 'GLASSO') are the Date, Time & Opening prices. The opening prices column requires no pre-processing, so that can be fed into a new, less noisy output file.
My issue is how to convert the two columns (date and time) into a single time measurement. I was thinking the most obvious way to do this would be to find the earliest point in time in my data and use this as point 0 (in seconds). I would then need to convert every time and date combination into a single column showing how many seconds it is past the original time point in the output CSV file. Once this was done rather than a file path specification I would like to be able to specify a folder and the program loop through all text files finding the relevant columns and output all into a single CSV file.
What this would hopefully look like in practice:
CSV title and first entry in one NYSE txt file -
"Date,Time,Open,High,Low,Close,Volume,OpenInt
2016-02-03,15:35:00,37.27,37.36,37.17,37.29,25274,0"
So essentially if the first entry is the earliest time reference:
2016-02-03,15:35:00 = '0'
2016-02-03,15:40:00 = '300' (5 minutes is 300 seconds)
Just to re-iterate, input is a folder containing hundreds of the following formatted CSVs:
Columns - 1: Date 2: Time 3: Open 4: High 5: Low 6: Close 7: Volume 8: OpenInt
Output is a single CSV file containing:
Columns - 1: Time measure (distance in seconds from earliest entry point) 2: Stock price for each time measure entry.
Please let me know if you have any clues about how I could go about doing this, don't hesitate to let me know if there is anything I can clarify to make your lives easier, I realise I could have maybe explained this in a less convoluted manner.