Regex to append some characters in a certain position

Question

I have a txt file which looks like this:

abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;

I want to modify it using regular expressions since it's a really long txt. I want before each CAT(...) and after the first ";" to append the first word of each line. There should be also a second ";" after the word appended and before the CAT. How can I do it?

So my output will be:

abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

Tim Biegeleisen · Accepted Answer · 2019-11-04T14:39:00.440

You may try the following find and replace, in regex mode:

Find:    ^([^(]+)(.*?;)(CAT.*)$
Replace: $1$2$1;$3

The idea here is to just subdivide each line into pieces we need to thread together the replacement. In this case, the first capture group is the word which we plan on inserting after the first semicolon, before CAT.

Demo

Just noticed you are using Python. We can try:

inp = """aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;
abadan(iof>city>thing);CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"""
output = re.sub(r'([^(]+)(.*?;)(CAT.*?;)\s*', '\\1\\2\\1;\\3\n', inp)
print(output)

This prints:

aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

@GDJS Sorry, I missed your Python tag. I added a short script which seems to work. — Tim Biegeleisen, Nov 04 '19 at 14:39

Alexander Rossa · Answer 2 · 2019-11-04T14:40:19.397

In Python you can do this as follows:

import re

test_strings = [
    'aarhus(iof>city>thing,equ>arhus);CAT(CATN),N(NP) ;',
    'abadan(iof>city>thing);CAT(CATN),N(NP) ;',
    'abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;' 
]
# first group matches the wordthat you want to repeat, then you capture the rest
# until the ;CAT which you capture separately
regex = r'(\w+)(.*)(;CAT.*)'

new_strings = []
for test_string in test_strings:
    match = re.match(regex, test_string)
    new_string = match.group(1) + match.group(2) + ";" + match.group(1) + match.group(3)
    new_strings.append(new_string)
    print(new_string)

Gives you:

aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

And your strings are stored in the new_strings list.

EDIT: To read your file as a list of strings ready to be modified just use with open statement and do readlines():

my_file = 'my_text_file.txt'

with open(my_file, 'r') as f:
    my_file_as_list = f.readlines()

I can't import my text as a unique String. It is a huge txt file (for the questions I only extracted the first three lines). — , Nov 04 '19 at 14:39
@GDJS That's a different problem then and you can search for how to read a file into a list of strings. Generally speaking it is quite simple, I will update my answer for general case. — Alexander Rossa, Nov 04 '19 at 14:41

RightmireM · Answer 3 · 2019-11-04T14:49:29.880

Matching different groups and knitting may be faster than regex replace. Would have to test

import re

#=== DESIRED ===================================================================
# aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
# abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
# abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;```
#===============================================================================

data = ["abadan(iof>city>thing);CAT(CATN),N(NP) ;", 
"abandon(icl>leave>do,agt>person,obj>person);CAT(CATV),AUX(AVOIR),VAL1(GN) ;"]

# Matching different groups, and then stiching together may be faster tna a regex replace. 
# Basedon https://stackoverflow.com/questions/3850074/regex-until-but-not-including
# (?:(?!CAT).)* - match anything until the start of the word CAT.
# I.e.
# (?:        # Match the following but do not capture it:
# (?!CAT)  # (first assert that it's not possible to match "CAT" here
#  .         # then match any character
# )*         # end of group, zero or more repetitions.
p = ''.join(["^", # Match start of string
             "(.*?(?:(?!\().)*)", # Match group one, anything up to first open paren, which will be the first word (I.e. abadan or abandon
             "(.*?(?:(?!CAT).)*)", # Group 2, match everything after group one, up to "CAT" but not including CAT
             "(.*$)" # Match the rest
             ])

for line in data:
    m = re.match(p, line)    
    newline  = m.group(1) # First word
    newline += m.group(2) # Group two
    newline += m.group(1) + ";" # First word again with semi-colon
    newline += m.group(3) # Group three

    print(newline)

OUTPUT:

abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

score 0 · Answer 4 · answered Nov 04 '19 at 15:04

This script reads the input file, does the replace and writes to output file:

import re

infile = 'input.txt'
outfile = 'outfile.txt'
f = open(infile, 'r')
o = open(outfile, 'w')
for line in f:
    o.write(re.sub(r'((\w+).+?)(?=;CAT)', r'\1;\2', line))

cat outfile.txt 
aarhus(iof>city>thing,equ>arhus);aarhus;CAT(CATN),N(NP) ;
abadan(iof>city>thing);abadan;CAT(CATN),N(NP) ;
abandon(icl>leave>do,agt>person,obj>person);abandon;CAT(CATV),AUX(AVOIR),VAL1(GN) ;

Regex to append some characters in a certain position

4 Answers4

Demo