Current performance-related issues:
- reading each input file 5x times => we want to limit this to a single read per input file
- calling
date
5x times (necessary) for each input file (unnecessary) => make the 5x date
calls prior to the for k in *.csv
loop [NOTE: the overhead for repeated date
calls will pale in comparison to the repeated reads of the input files]
Potential operational issue:
sed
is not designed for doing comparisons of data (eg, look for a string that is >=
a search pattern); consider an input file like such:
$ cat input.csv
2021-01-25
2021-03-01
If 'today' is 2021-03-14
then for the 1month
dataset the current sed
solution is:
sed '/2012-02/,$p'
But because there are no entries for 2012-02
the sed
command returns 0 rows, even though we should see the row for 2021-03-01
.
Granted, for this particular question we're looking for dates based on the month, and the application likely generated at least one row on a monthly basis, so this issue likely won't be an issue but, we need to be aware of this issue in general.
Anyhoo, back to the question at hand ...
Assumptions:
- input files are comma-delimited (otherwise need to adjust the proposed solution)
- the date to be tested is of the format
YYYY-MM-...
- the data to be tested is the 1st field of the comma-delimited input file (otherwise need to adjust the proposed solution)
- output filename prefix is the input filename sans the
.csv
Sample input:
$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
We only need to do the date calculations once so we'll do this in bash
and prior to OP's for k in *.csv
loop:
# date as of writing this answer: 2021-10-12
$ yr2=$(date -d "2 year ago" '+%Y-%m')
$ yr1=$(date -d "1 year ago" '+%Y-%m')
$ mon6=$(date -d "6 month ago" '+%Y-%m')
$ mon3=$(date -d "3 month ago" '+%Y-%m')
$ mon1=$(date -d "1 month ago" '+%Y-%m')
$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"
One awk
idea (replaces all of the sed
calls in OP's current for k in *.csv
loop):
# determine prefix to be used for output files ...
$ k=input.csv
$ prefix="${k//.csv/}"
$ echo "${prefix}"
input
awk -v yr2="${yr2}" \
-v yr1="${yr1}" \
-v mon6="${mon6}" \
-v mon3="${mon3}" \
-v mon1="${mon1}" \
-v prefix="${prefix}" \
-F ',' ' # define input field delimiter as comma
{ split($1,arr,"-") # date to be compared is in field #1
testdate=arr[1] "-" arr[2]
if ( testdate >= yr2 ) print $0 > prefix".2years.csv"
if ( testdate >= yr1 ) print $0 > prefix".1year.csv"
if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"
NOTE: awk
can dyncially process the input filename to determine the filename prefix (see FILENAME
variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)
This generates the following files:
for f in "${prefix}".*.csv
do
echo "############# ${f}"
cat "${f}"
echo ""
done
############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
Additional performance improvements:
- [especially for largish files] read from one filesystem, write to a 2nd/different filesystem; even better would be a separate filesystem for each of the 5x different output files - would require a minor tweak to the
awk
solution
- processing X number of input files in parallel, eg,
awk
code could be placed in a bash
function and then called via <function_name> <input_file> &
; this can be done via bash
loop controls, parallel
, xargs
, etc
- if running parallel operations, will need to limit the number of parallel operations based primarily on disk subsystem throughput, ie, how many concurrent reads/writes can disk subsystem handle before slowing down due to read/write contention