Replace string between subsequent delimiters with substring

Question

I have a 17GB pipe-delimited .txt file, and need to replace any strings that are more than 10 characters, between the 32nd and 33rd pipe, to their first 10 characters in order to populate a database column without opening the file in sublime-text; so it would need to be done through Java or AIX-BASH. On regex101.com I was trying to implement the ideas presented in the following post:

RegEx: Match nth occurence

but it doesn't limit the matched pattern only to my replacement-string.

Sample input:

|12210|IA||15||i956-743||||||l.4073||||a5015b3ed||l.464939|IC|||06 06:18:17||wireered||ENTITY|wirvered|2||||NoPodfoundorpoddoesnothaveedgetob-rd=l.415.63Z|REY||||RY|REY||

Intended output:

Change ...|NoPodfundddorpoddoesnot...|... to ...|NoPodfundd|...

Full output string after replacement/truncation:

|12210|IA||15||i956-743||||||l.4073||||a5015b3ed||l.464939|IC|||06 06:18:17||wireered||ENTITY|wirvered|2||||NoPodfundd|REY||||RY|REY||

Attempt at regex match:

^(?:[^|]*\|){32}[^|]+\| which matches everything from the start to the 33rd |, so |12210.......l.415.63Z|, but I want it to only match the string between pipes 32 and 33, specifically NoPodfoundorpoddoesnothaveedgetob-rd=l.415.63Z, for replacement purposes.

update 1; 10/18/17:

(^(?:[^|]*\|){32}[^|]{0,10})([^|]*)(\|.*$) group capture substitution with \1\3 provides the desired result. But this match must have a flaw since it seems to be capturing a non-capturing group (?:[^|]*\|).

update 2; 10/19/17:

Tried the following commands in PUTTY command line, but it does not edit the file:

cat subStrTest.txt
awk 'BEGIN{FS=OFS="|"}{$33=substr($33,1,10)} 1' subStrTest.txt

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html suggests that

string = substr(string,startIndex,numOfCharacters)

is valid syntax, at least for gawk, but I don't know whether the assignment

$33=substr($33,1,10)

is valid for strings referenced with $, as in $33 within awk

There's no flaw, it's capturing the non-capture group because it's nested in a capture group, therefore it will be caught. Using other flavours of regex, you can do without the first capture group since the `\K` token can be used. Try `^(?:[^|]*\|){32}\K(([^|]{0,10})[^|]*)(?=\|)` on regex101. Unfortunately, Java doesn't support this token (as far as I'm aware) — ctwheels, Oct 17 '17 at 20:37
Yes, it's a job for awk. Guess it would be something like this: `awk 'BEGIN { FS="|"; OFS="|"; } { $33= substr ($33,1,10); print; }'` — Lorinczy Zsigmond, Oct 18 '17 at 06:03
Please review the answer to this question: https://stackoverflow.com/questions/46600250/removing-blank-spaces-in-specific-column-in-pipe-delimited-file-in-aix/46600539#comment80184535_46600539 I think with only slight adjustment, the solution will fit your needs. It is essentially the same as Lorinczy's above. — pedz, Oct 19 '17 at 02:27

score 0 · Answer 1 · answered Oct 17 '17 at 18:39

0

You can make match group and replace it another data ^(?:[^|]*\|){32}([^|]+)\|

answered Oct 17 '17 at 18:39

rabhis

440
2
5

score 0 · Answer 2 · answered Oct 17 '17 at 20:34

0

See regex in use here

Regex

^((?:[^|]*\|){32})(([^|]{0,10})[^|]*)(?=\|)

Replace

\1\3

answered Oct 17 '17 at 20:34

ctwheels

21,901
9
42
77

Replace string between subsequent delimiters with substring

2 Answers2