1

I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.

I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?

for k in *.csv; do
    sed -n '/'"$(date -d "2 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
    sed -n '/'"$(date -d "1 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
    sed -n '/'"$(date -d "6 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
    sed -n '/'"$(date -d "3 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
    sed -n '/'"$(date -d "1 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done
usert4jju7
  • 1,653
  • 3
  • 27
  • 59
  • 6
    If performance is a concern, you shouldn't be using `bash`. Use a real programming language. – Barmar Oct 12 '21 at 08:34
  • 1
    @ceving He's not doing `grep`. He's printing everything from the matching line to the end of the file. – Barmar Oct 12 '21 at 08:37
  • 2
    Are these files large? Reading each file 5 times is clearly not the best way to do it. In a language like Python you could read it once and write to all 5 output files as you go. – Barmar Oct 12 '21 at 08:38
  • 4
    You could also use `awk` for this, it can write multiple files in a single script. – Barmar Oct 12 '21 at 08:39
  • Hello Barmar. Very useful comments. Thank you. Yes the files are fairly large going into a few GB sometimes. – usert4jju7 Oct 12 '21 at 08:43
  • 3
    Or you can use Perl, its [Text::CSV_XS](http://p3rl.org/Text::CSV_XS) is very fast and abundantly capable. – choroba Oct 12 '21 at 08:44
  • 1
    `date -d "2 year ago" '+%Y-%m'` you are calling the same process with the same output for each file. – KamilCuk Oct 12 '21 at 09:17
  • 1
    Make it parallel, e.g. [like this](https://stackoverflow.com/a/69452918/8584929) or [like this](https://stackoverflow.com/a/69209835/8584929). – Andrej Podzimek Oct 12 '21 at 15:35
  • 1
    #1 way to ensure performance in a shell script is to not spawn subshells within a loop. Whenever you have `...; do utility1, utility2, utility3 ... done` every utility called within the loop requires its own separate subshell. This generally means you need to rethink the tool you have chosen for the job. `awk` can likely handle the entire operation in a single call and will be orders of magnitude faster than a loop in a shell script that spawns multiple subshells. – David C. Rankin Oct 15 '21 at 00:51

3 Answers3

2

You read each CSV five times. It would be better to read each CSV only once.

You extract the same data multiple times. All but one parts are subsets of the others.

  • 2 years ago is a subset of 1 year ago, 6 months ago, 3 months ago and 1 month ago.
  • 1 year ago is a subset of 6 months ago, 3 months ago and 1 month ago.
  • 6 months ago is a subset of 3 months ago and 1 month ago.
  • 3 months ago is a subset of 1 month ago.

This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee.

The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)

sed -n "/$(date -d '1 month ago' '+%Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' '+%Y-%m')/,\$p" > temp_data_store/${k}.2years.csv
ceving
  • 21,900
  • 13
  • 104
  • 178
  • 1
    Thanks a ton Ceving. Few questions -This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". - Is it not the other way round? 2 years has every line from 1 yeasr which every entry from 6 months which has every line from 3 months & so on? – usert4jju7 Oct 12 '21 at 11:11
  • 1
    Also, when we are using tee, do we still need ${k} in the command sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" "${k}" ? Or will piping the output of one command into the next good enough? – usert4jju7 Oct 12 '21 at 11:13
  • 1
    The point was to remove reading the same file 5 times isn't it? – usert4jju7 Oct 12 '21 at 11:13
  • @usert4jju7 You are right. I have removed it. Copy paste error. – ceving Oct 14 '21 at 09:35
  • need to reverse the order of the `sed` calls; as currently written the first pulls data for the last 1 month; this same 1-month old data is then piped to the follow on `sed` commands but they find nothing older than 1 month; it's actually worse because the patterns for the older time periods find *no* matches so the last 4x files are empty – markp-fuso Oct 14 '21 at 23:49
  • Thanks a ton for getting me thinking in right direction. While this was helpful, I chose to go with the solution proposed by Ole for ease of use, maintainability & readability. I'll use your solution elsewhere. I wish I could accept all the answers – usert4jju7 Oct 15 '21 at 07:46
2

Current performance-related issues:

  • reading each input file 5x times => we want to limit this to a single read per input file
  • calling date 5x times (necessary) for each input file (unnecessary) => make the 5x date calls prior to the for k in *.csv loop [NOTE: the overhead for repeated date calls will pale in comparison to the repeated reads of the input files]

Potential operational issue:

sed is not designed for doing comparisons of data (eg, look for a string that is >= a search pattern); consider an input file like such:

$ cat input.csv
2021-01-25
2021-03-01

If 'today' is 2021-03-14 then for the 1month dataset the current sed solution is:

sed '/2012-02/,$p'

But because there are no entries for 2012-02 the sed command returns 0 rows, even though we should see the row for 2021-03-01.

Granted, for this particular question we're looking for dates based on the month, and the application likely generated at least one row on a monthly basis, so this issue likely won't be an issue but, we need to be aware of this issue in general.

Anyhoo, back to the question at hand ...


Assumptions:

  • input files are comma-delimited (otherwise need to adjust the proposed solution)
  • the date to be tested is of the format YYYY-MM-...
  • the data to be tested is the 1st field of the comma-delimited input file (otherwise need to adjust the proposed solution)
  • output filename prefix is the input filename sans the .csv

Sample input:

$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

We only need to do the date calculations once so we'll do this in bash and prior to OP's for k in *.csv loop:

# date as of writing this answer: 2021-10-12

$ yr2=$(date -d "2 year ago" '+%Y-%m')
$ yr1=$(date -d "1 year ago" '+%Y-%m')
$ mon6=$(date -d "6 month ago" '+%Y-%m')
$ mon3=$(date -d "3 month ago" '+%Y-%m')
$ mon1=$(date -d "1 month ago" '+%Y-%m')

$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"

One awk idea (replaces all of the sed calls in OP's current for k in *.csv loop):

# determine prefix to be used for output files ...

$ k=input.csv

$ prefix="${k//.csv/}"
$ echo "${prefix}"
input

awk -v yr2="${yr2}"       \
    -v yr1="${yr1}"       \
    -v mon6="${mon6}"     \
    -v mon3="${mon3}"     \
    -v mon1="${mon1}"     \
    -v prefix="${prefix}" \
    -F ',' '                                # define input field delimiter as comma

{ split($1,arr,"-")                         # date to be compared is in field #1
  testdate=arr[1] "-" arr[2]

  if ( testdate >= yr2  ) print $0 > prefix".2years.csv"
  if ( testdate >= yr1  ) print $0 > prefix".1year.csv"
  if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
  if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
  if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"

NOTE: awk can dyncially process the input filename to determine the filename prefix (see FILENAME variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)

This generates the following files:

for f in "${prefix}".*.csv
do
    echo "############# ${f}"
    cat "${f}"
    echo ""
done

############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

Additional performance improvements:

  • [especially for largish files] read from one filesystem, write to a 2nd/different filesystem; even better would be a separate filesystem for each of the 5x different output files - would require a minor tweak to the awk solution
  • processing X number of input files in parallel, eg, awk code could be placed in a bash function and then called via <function_name> <input_file> &; this can be done via bash loop controls, parallel, xargs, etc
  • if running parallel operations, will need to limit the number of parallel operations based primarily on disk subsystem throughput, ie, how many concurrent reads/writes can disk subsystem handle before slowing down due to read/write contention
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Thanks a ton for such detailed info @markp-fuso. I'll need t read this a few times to digest. I'll report back once I've tried – usert4jju7 Oct 14 '21 at 21:40
  • Thanks a ton for such great info. While this was helpful, I chose to go with the solution proposed by Ole for ease of use, maintainability & readability. I'll use your solution elsewhere. I wish I could accept all the answers – usert4jju7 Oct 15 '21 at 07:45
1

Something like this may work if you have GNU Parallel (version >= 20191022):

do_one() {
  cat "$1" | parallel --pipe --tee sed -n '/$(date -d {}" ago" '+%Y-%m')/,\$p' "> temp_data_store/$1.{}.csv" \
    ::: "2 year" "1 year" "6 month" "3 month" "1 month"
}
export -f do_one
parallel -j1 do_one ::: *.csv

If your disks are fast: Remove -j1.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104