1

I am trying to quickly find all folders named in a yyyymmdd_hhmmss format between two dates and times. These dates and times are variables set on user input.

E.g., all folders between

20221231_120000
20230101_235920

All dates/times looked for being valid is not a requirement for me.

Note that the 'age' of the folders does not match their names.


I have looked at regex but it seems like a complex solution for variable dates/times.

I have looked at Ansible find module patterns but they are incredibly slow, because it runs the find command for every sequential number. Taking about 1 second per checked number.

For example:

  - name: Find folders matching dates and times
    vars:
      startdate: "20230209"
      enddate: "20230209"
      starttime: "120000"
      endtime: "130000"
    ansible.builtin.find:
      paths:
        - "/folderstocheck/
      file_type: directory
      patterns: "{{ item[0:8] }}_{{item[8:-1]}}"
    with_sequence: start={{ startdate + starttime }} end={{ enddate + endtime }}
    register: found_files

Takes approximately 167 minutes to run

gemenerik
  • 13
  • 3
  • "_All dates looked for being valid is not a requirement for me._", I understand that there can be values like `20220031_120000` or `20220231_120000` and which is OK, so one could look more on it like a "version sort". How about the time? Is it `120000` for every folder or can there be values from `000000` to `235959`? – U880D Feb 10 '23 at 10:59
  • why does it have to be quickly? why did you 'bold' the text? – Kevin C Feb 10 '23 at 12:20
  • The folders are named by any real date and real time. I modified the example to clarify. I do not care if the algorithm would return folders of invalid dates and times, because these folders do not exist. This might indeed simplify the solution. I specify that I want it to be quick, because I already have found a slow solution. – gemenerik Feb 10 '23 at 12:54
  • Thanks for specifying the search pattern. Input data, means longer detailed list of available folder names will also help to provide a sloution you are looking for. Otherwise one has to guess how the input could look like. "_I specify that I want it to be quick, because I already have found a slow solution._", can you update the question and show us your solution? Also provide more details what means slow? Since almost no details are provided, how should the result look like? What do you try to do with it later? Such will have an impact for a solution proposal. – U880D Feb 10 '23 at 13:00
  • My original solution was to use the the pattern with sequence as in the question, I will extend that solution a little bit. I want to use the results for further filtering and eventually synchronize the resulting folders. – gemenerik Feb 10 '23 at 14:52

1 Answers1

0

Regarding

Note that the 'age' of the folders does not match their names.

I like to recommend to streamline the folder access and modification times with the names so that simple OS functions or Ansible modules like stat could come in place. Such will make any processing a lot easier.

How to do that? I have a somehow similar use case of Change creation time of files (RPM) from download time to build time which shows the idea and how one could achieve that.


Given some test directories as input

:~/test$ tree 202*
20221231_110000
20221231_120000
20221231_130000
20221232_000000
20230000_000000
20230101_000000
20230101_010000
20230101_020000
20230101_030000
20230101_120000
20230101_130000

a minimal example playbook

---
- hosts: localhost
  become: false
  gather_facts: false

  vars:

    FROM: "20221231_120000"
    TO: "20230101_120000"

  tasks:

  - name: Get an unordered list of directories with pattern 'yyyymmdd_hhmmss'
    find:
      path: "/home/{{ ansible_user }}/test/"
      file_type: directory
      use_regex: true
      patterns: "^[1-2]{1}[0-9]{7}_[0-9]{6}" # can be more specified
    register: result

  - name: Order list
    set_fact:
      dir_list: "{{ result.files | map(attribute='path') | map('basename') | community.general.version_sort }}"

  - name: Show directories between
    debug:
      msg: "{{ item }}"
    when: item is version(FROM, '>=') and item is version(TO, '<=') # means between
    loop: "{{ dir_list }}"

will result into an output of

TASK [Get a unordered list of directories with pattern 'yyyymmdd_hhmmss'] ******
ok: [localhost]

TASK [Order list] **************************
ok: [localhost]

TASK [Show directories between] ************
ok: [localhost] => (item=20221231_120000) =>
  msg: '20221231_120000'
ok: [localhost] => (item=20221231_130000) =>
  msg: '20221231_130000'
ok: [localhost] => (item=20221232_000000) =>
  msg: '20221232_000000'
ok: [localhost] => (item=20230000_000000) =>
  msg: '20230000_000000'
ok: [localhost] => (item=20230101_000000) =>
  msg: '20230101_000000'
ok: [localhost] => (item=20230101_010000) =>
  msg: '20230101_010000'
ok: [localhost] => (item=20230101_020000) =>
  msg: '20230101_020000'
ok: [localhost] => (item=20230101_030000) =>
  msg: '20230101_030000'
ok: [localhost] => (item=20230101_120000) =>
  msg: '20230101_120000'

Some measurement

Get an unordered list of directories with pattern 'yyyymmdd_hhmmss' -- 0.50s
Show directories between --------------------------------------------- 0.24s
Order list ----------------------------------------------------------- 0.09s

According the given initial description there is no timezone and daylight saving time involved. So this is working because the given pattern is just a kind of incrementing number, even if a human may interpret it as date. It could even be simplified if more information regarding the hour is provided. Means, if it is every time 1200 that insignificant part could be dropped and leaving one with a simple integer number. The same would be true for the delimiter _.


Regarding

... they are incredibly slow, because it runs the find command for every sequential number ... with_sequence ...

that is not necessary and seems for me like the case of How do I optimize performance of Ansible playbook with regards to SSH connections?

Looping over commands and providing one parameter for the command per run results into a lot of overhead and multiple SSH connections as well, providing the list directly to the command might be possible and increase performance and decrease runtime and resource consumption.

Further processing can be done just afterwards.

U880D
  • 8,601
  • 6
  • 24
  • 40