matching numbers after nth occurence of a certain symbol in a line

Question

I'm not sure if using regex is the correct way to go about this here, but I wanted to try solving this with regex first (if it's possible)

I have an edifact file, where the data (in bold) in certain fields in some segments need to be substituted (with different dates, same format)

UNA:+,? '  
UNB+UNOC:3+000000000+000000000+20190801:1115+00001+DDMP190001'  
UNH+00001+BRKE:01+00+0'    
INV+ED Format 1+Brustkrebs+19880117+E000000001+**20080702**+++1+0'       
FAL+087897044+0000000++name+000000000+0+**20080702**++1+++J+N+N+N+N+N+++0'   
INL+181095200+385762115+++0'   
BEE+20080702++++0'   
BAA+++J+J++++++J+++++++J++0'   
BBA++++++++J++++++J+J++++++J+++++J+++J+J++++++++J+0'   
BHP+J+++++J+++++J+++++0'   
BLA+++J+++++++++0'   
BFA++++++++++++J++0'   
BSA++J+++J+J+++0'    
BAT+20190801+0'    
DAT+**20080702**++++0'   
UNT+000014+00001'   
UNZ+00001+00001'

at first I was able to match those fields using a positive lookahead and a lookbehind (I had different expressions for matching each date).

Here, for example is the expression I intially used to match the date in the "FAL" segment: (?<=\+[\d]{1}\+)\d{8}(?=\+\+), but then i saw that this date is sometimes preceeded by 9 digits, and sometimes by 1 (based on version) and followed by a either ++ or a + and a date so I added a logiacl OR like this: (?<=\+[\d]{9}\+|\+[\d]{1}\+)\d{8}(?=\+[\d]{8}\+|\+\+)and quickly realized it's not sustainable because I saw that these edifact files vary (far beyond only either 9 and 1 digits)

(I have 6 versions for each type, and i have 6 types total)

Because I have a scheme/map indicating what each version should be built like and I know on what position (based on the + separator) the date is written in each version, I thought about maybe matching the date based on the +, so after the 7th occurence (say in the FAL segment) of plus in a certain line, match the next 8 digits.

is this possible to achieve with regex? and if yes, could someone please tell me how?

Try searching for `^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)` and replace with `${1}20200101` - what is the regex engine? — Wiktor Stribiżew, Dec 13 '19 at 09:30
that kind of worked! I had no idea what "regex engine" meant but i googled it and i guess its Traditional NFA.. (https://regex101.com/r/oSVlS8/2) when i change the Flavour to python there, it doesnt work tho, any ideas what i can do? thanks! — skdadle, Dec 13 '19 at 11:53
So are you going to use it in Python? See [this regex demo](https://regex101.com/r/oSVlS8/3), Python `re` uses a different unambiguous backreference syntax. — Wiktor Stribiżew, Dec 13 '19 at 11:54
There is not always `FAL` in this line, right? So you can't use [`(?m)^(FAL.*)\b(\d{8})\b`](https://regex101.com/r/oX3xiq/1) — bobble bubble, Dec 13 '19 at 13:02
actually there is, the only problem is, I might have other fields in this line that are also 8 digits long. does this match the first 8 digits encountered after "FAL"? — skdadle, Dec 16 '19 at 08:27

score 2 · Accepted Answer · answered Dec 13 '19 at 12:04

2

I suggest using a pattern like

^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)

where {7} can be adjusted to the value you need for each type of segments, and replace with the backreference to Group 1. In Python, it is \g<1>20200101 (where 20200101 is your new date), in PHP/.NET, it is ${1}20200101. In JS, it will be just $1.

To run on a multiline text, use m flag. In Python regex, you may embed it like (?m)^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+).

See the Python regex demo

Details

^ - start of string/line
((?:[^+\n]*\+){7}) - Group 1: 7 repetitions of any chars other than + and newline, and then a +
\d{8} - 8 digits
(?=\+(?:\d{8})?\+) - that are followed with +, and optional chunk of 8 digits and a +.

answered Dec 13 '19 at 12:04

Wiktor Stribiżew

607,720
39
448
563

thank you so much! because the expression does exactly what i asked for in my question, I'm marking it as the answer, I now have another question though.. when I implement that expression in my code, I get "None" as output, any ideas? f=open("test edifakt 1 bk v1.txt", "r") if f.mode == 'r': readfile = f.read() Fal = re.search(r"^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)" , readfile,) print("Fal date: ", Fal) – skdadle Dec 13 '19 at 12:23
@skdadle Do you want to extract the Fal dates? You need https://pastebin.com/t2eSkZ64, `with open("test edifakt 1 bk v1.txt", "r") as fr: fal_dates = re.findall(r'(?m)^(?:[^+\n]*\+){7}(\d{8})\+(?:\d{8})?\+', fr.read())` – Wiktor Stribiżew Dec 13 '19 at 12:33
is it possible to have a regex that looks for the date after the nth "+", but only explicitly in lines starting with "FAL" or "DAT" or "INV" (also in the non capturing group?) because when I change the quantifier to 1, to try and get the date from the DAT segment, it also matches other ones like the date in "BAT+20190801+0" and i just purely want the date in the DAT line – skdadle Dec 16 '19 at 09:52
@skdadle See [this regex demo](https://regex101.com/r/oSVlS8/5). – Wiktor Stribiżew Dec 16 '19 at 09:57

matching numbers after nth occurence of a certain symbol in a line

1 Answers1