0

I have a 17GB pipe-delimited .txt file, and need to replace any strings that are more than 10 characters, between the 32nd and 33rd pipe, to their first 10 characters in order to populate a database column without opening the file in sublime-text; so it would need to be done through Java or AIX-BASH. On regex101.com I was trying to implement the ideas presented in the following post:

RegEx: Match nth occurence

but it doesn't limit the matched pattern only to my replacement-string.

Sample input:

|12210|IA||15||i956-743||||||l.4073||||a5015b3ed||l.464939|IC|||06 06:18:17||wireered||ENTITY|wirvered|2||||NoPodfoundorpoddoesnothaveedgetob-rd=l.415.63Z|REY||||RY|REY||

Intended output:

Change ...|NoPodfundddorpoddoesnot...|... to ...|NoPodfundd|...

Full output string after replacement/truncation:

|12210|IA||15||i956-743||||||l.4073||||a5015b3ed||l.464939|IC|||06 06:18:17||wireered||ENTITY|wirvered|2||||NoPodfundd|REY||||RY|REY||

Attempt at regex match:

^(?:[^|]*\|){32}[^|]+\| which matches everything from the start to the 33rd |, so |12210.......l.415.63Z|, but I want it to only match the string between pipes 32 and 33, specifically NoPodfoundorpoddoesnothaveedgetob-rd=l.415.63Z, for replacement purposes.

update 1; 10/18/17:

(^(?:[^|]*\|){32}[^|]{0,10})([^|]*)(\|.*$) group capture substitution with \1\3 provides the desired result. But this match must have a flaw since it seems to be capturing a non-capturing group (?:[^|]*\|).

update 2; 10/19/17:

Tried the following commands in PUTTY command line, but it does not edit the file:

cat subStrTest.txt
awk 'BEGIN{FS=OFS="|"}{$33=substr($33,1,10)} 1' subStrTest.txt

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html suggests that

string = substr(string,startIndex,numOfCharacters)

is valid syntax, at least for gawk, but I don't know whether the assignment

$33=substr($33,1,10)

is valid for strings referenced with $, as in $33 within awk

Parth Patel
  • 23
  • 1
  • 4
  • If you are running on unix/linux, why not use `sed`? – M. le Rutte Oct 17 '17 at 18:23
  • 1
    Try using awk. It's great for such cases. – Malt Oct 17 '17 at 18:32
  • There's no flaw, it's capturing the non-capture group because it's nested in a capture group, therefore it will be caught. Using other flavours of regex, you can do without the first capture group since the `\K` token can be used. Try `^(?:[^|]*\|){32}\K(([^|]{0,10})[^|]*)(?=\|)` on regex101. Unfortunately, Java doesn't support this token (as far as I'm aware) – ctwheels Oct 17 '17 at 20:37
  • 1
    Yes, it's a job for awk. Guess it would be something like this: `awk 'BEGIN { FS="|"; OFS="|"; } { $33= substr ($33,1,10); print; }'` – Lorinczy Zsigmond Oct 18 '17 at 06:03
  • Please review the answer to this question: https://stackoverflow.com/questions/46600250/removing-blank-spaces-in-specific-column-in-pipe-delimited-file-in-aix/46600539#comment80184535_46600539 I think with only slight adjustment, the solution will fit your needs. It is essentially the same as Lorinczy's above. – pedz Oct 19 '17 at 02:27

2 Answers2

0

You can make match group and replace it another data ^(?:[^|]*\|){32}([^|]+)\|

rabhis
  • 440
  • 2
  • 5
0

See regex in use here

Regex

^((?:[^|]*\|){32})(([^|]{0,10})[^|]*)(?=\|)

Replace

\1\3
ctwheels
  • 21,901
  • 9
  • 42
  • 77