1

I have column in Postgres db which has text in char varying data type. The text includes an uri which contains file name and resembles as below;

  The file is a file of \\88-77-99-666.abc.example.com\Folder1\Folder2\Folder3\Folder4\20221122\12345678.PDF [9bc8rer55c655f4cb5df763c61862d3fdde9557b0] is the sha1 of the file.

I am trying to get the file name 12345678.PDF and date 20221122 from the text content. However, regexp_replace either gives me everything till file name or everything after filename. I am trying to get only file name

1>> Regexp_replace(data, '.+\\', '')

Yields filename and everything after it

 2>> Regexp_replace(data, '\[.*', '')

Yields filename and everything after it

If I capture two groups like below I get same result as 1.

Regexp_replace(data, '.+\\|\[', '')

How can I substitute 2 groups and only get filename? Or what is the better way to achieve this? And I need to get the date value but if I can figure this out maybe I will be able to apply the learning for to extract date value. Thanks for your time.

Alsheik
  • 39
  • 6
  • You're running a replace function, so you'll need to capture the part that you want to keep and replace the rest of the string with it. Try something like ```Regexp_replace(data, '.+\\(.+)`.*', '\1')``` – CAustin Nov 23 '22 at 01:54
  • I tried it but I getting the full string back. I tried this `substring(data from '\w*.PDF')` which returns the desired results but if the extension is not PDF then I am not getting the result. I could use `\w*\.[aA-zZ] but the string has domain as `example.vpc.com` resulting in undesired result. Trying to figure out how to further qualify the substring to get extensions such as Pdf, pdf, DOC, doc and its likes – Alsheik Nov 23 '22 at 05:44
  • Maybe ``REGEXP_MATCHES(col, "`([^`]+)` *\[([^][]+)")``? – Wiktor Stribiżew Nov 23 '22 at 08:37
  • @WiktorStribiżew I tried your suggestion and getting null results – Alsheik Nov 23 '22 at 18:43
  • @Alsheik It [works here](https://dbfiddle.uk/DvfzeQCR). – Wiktor Stribiżew Nov 23 '22 at 19:03
  • Unfortunately, it did not work; I have scrubbed the original content for your test; I tried it [here](https://dbfiddle.uk/wV0relDd) with 0 results, additionally updating the question with actual text content as an aside, only looking for filename from the URI. Thanks for looking into this `SELECT REGEXP_MATCHES('The file is a file of \\88-77-99-666.abc.example.com\Folder1\Folder2\Folder3\Folder4\20221122\2779780.PDF [9bc8rer55c655f4cb5df763c61862d3fdde9557b0] is the sha1 of the file.', '`([^`]+)` *\[([^\]\[]+)') AS Result;` – Alsheik Nov 23 '22 at 21:12
  • @WiktorStribiżew pardon my ignorance what is the purpose of backtick ` \` ` character in your expression? – Alsheik Nov 23 '22 at 21:15
  • @Alsheik Your string - that you edited an hour ago - contained the backticks. If you have a char in your string, it can be matched. – Wiktor Stribiżew Nov 23 '22 at 22:15
  • Apologies, formatting error my end, did not realize the backtick existed in my original post. I made the change and I was getting only `{The}` which is the first word in the test content. Check [here](https://dbfiddle.uk/79OaOjx-). I'm definitely missing something – Alsheik Nov 24 '22 at 06:00
  • Maybe `'(\S+)\s+\[([^][]+)'`? – Wiktor Stribiżew Nov 24 '22 at 15:10
  • Thanks for the suggestion. It did not yield the desired result of grabbing only the `filename.pdf`, – Alsheik Nov 24 '22 at 19:51

1 Answers1

1

You can use

SELECT REGEXP_MATCHES(
  'The file is a file of \\88-77-99-666.abc.example.com\Folder1\Folder2\Folder3\Folder4\20221122\2779780.PDF [9bc8rer55c655f4cb5df763c61862d3fdde9557b0] is the sha1 of the file.',
  '([^[:space:]\\/]+)\s+\[([^][]+)') AS Result;

See the DB fiddle, result:

enter image description here

Details:

  • ([^[:space:]\\/]+) - Group 1: one or more chars other than \, / and whitespace
  • \s+ - one or more whitespaces
  • \[ - a [ char
  • ([^][]+) - Group 2: one or more chars other than [ and ].
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks @WiktorStribiżew, your answer got me what I wanted with slight change `([^[:space:]\\/]+)\s+\[` as it gets the required filename.pdf portion and I decided to go with `substring` vs `regex_matches` to avoid the curly braces, [SQLize](https://sqlize.online/sql/psql12/a508b45a48204991ec2ea61083292d22/). Again thanks for the help. Now on to figuring out how to extract the `date folder` before `\filename.pdf`will update the thread once I find the answer. – Alsheik Nov 26 '22 at 06:16
  • To find the date folder, '`\\([0-9]{8})\\`' gives me the desired results as its in YYYYMMDD. [SQLize](https://sqlize.online/sql/psql12/1e5a216991b0b5eff611d67c042b5d85/). Hope this helps someone. – Alsheik Nov 26 '22 at 07:04
  • Unable to get the 2xbackslashes in my previous comment, refer to SQLize link for answer – Alsheik Nov 26 '22 at 07:11