2

I'm looking for a bit of help here. I'm a complete newbie!

I need to look in a file for a code matching the pattern A00000_00_A and append a count to it, so the first time it appears it is replaced with A00000_00_A_001, second time A00000_00_A_002 etc. The output needs to be written back to the same file. Each file only contains 1 code, but it appears multiple times.

After some digging I have found-

perl -pi -e 's/Q\d{4,5}'_'\d{2}_./$&.'_'.++$A /ge' /users/documents/*.xml

but the issue is the counter does not reset in each file.

That is, the output of the first file is say Q00390_01_A_1 to Q00390_01_A_7, while the second file is Q00391_01_A_8 to Q00391_01_A_10.

What I want is Q00390_01_A_1 to Q00390_01_A_7 in the first file and Q00391_01_A_1 to Q00391_01_A_2 in the second.

Does anyone have any idea on how to edit the above code to make it do that? I'm a total newbie so ideally an edit to what I have would be brilliant. Thanks

aewan86
  • 35
  • 4
  • 1
    First of all, you should not try to parse XML files with anything but an XML parser. Using Python would let you implement it entirely without chaining multiple tools. For the leading 0, `printf` (awk, Python, shell have it) is what you need. For iterating files in folders, iterate a file globing pattern (Python, shell are capable of iterating globing pattern files). As for changing files in-place, a temporary file is always the way unless the tool you use can load the entire file in memory and write it back. Also consider citing a example of the XML files you are trying to modify. – Léa Gris Mar 05 '22 at 13:05
  • 1
    `sprintf("%03d",++count)` returns a 3-digit, 0-padded string; if you're using `GNU awk` then `awk -i inplace ...` can be used to overwrite the file you're reading from (though nothing wrong with using a temp file as an intermediate medium) – markp-fuso Mar 05 '22 at 14:12
  • Thanks! After some more research I've found – aewan86 Mar 05 '22 at 16:20
  • cd /users/documents/ for f in *.xml do perl -pi -e 's/'facs=.'(Q|M)\d{4,5}'_'\d{2}_\w/$&.'_'.sprintf("%04d",++$A) /ge' $f done – aewan86 Mar 05 '22 at 16:52
  • Do you want to append `1` or `001`? They are quite different, and also behave differently in that the latter can be sorted alphabetically. You really should have a test case to perform this code on, especially since you imply that the source is XML. While it is certainly possible to edit XML with regexes, it can actually also be very bad, as this now legendary post says: https://stackoverflow.com/a/1732454/725418 – TLP Mar 06 '22 at 10:30
  • Thanks for this link! I had no idea what an ethical minefield I was stepping into. Nathan's answer works perfectly on the files I need it to work on- they are very simple xml files and I know about all the tags that are in them – aewan86 Mar 07 '22 at 09:40
  • @aewan86 Note that your solution does not require that the *same* tags get ascending numbers, only that tags of the same format get ascending numbers.. I.e. you can get `Q_0000_00_001` `Q_0123_99_002` `Q_4440_12_003`, etc. Of course, the data set in the file might prevent this, but a better solution might include using a hash to check all possible matches. – TLP Mar 07 '22 at 15:04

1 Answers1

4
cd /users/documents/
for f in *.xml;do
perl -pi -e 's/facs=.(Q|M)\d{4,5}_\d{2}_\w/$&._.sprintf("%04d",++$A) /ge' $f
done

This matches the string facs= and any character, then "Q" or "M" followed by either four or five digits, then an underscore, then two digits, another underscore, and a word character. The entire match is then concatenated with an underscore and the value of $A zero padded to four digits.

Nathan Mills
  • 2,243
  • 2
  • 9
  • 15