3

I am going through text that looks like the below:

I'm goin|going to be here because I hafta|have to

I would like to strip the substring before the colon, and replace it with just the substring ater the color. Thus, the above string should look like

I'm going to be here because I have to

I know I can do this with a python loop, like below, but need the speed of a regular expression for this

s = "I'm goin|going to be here because I hafta|have to"
for word in s.split():
     if '|' in word:
             word = word.split('|')[1]
     print(word)

I would like to use something like re.sub to process this line, though.

Adam_G
  • 7,337
  • 20
  • 86
  • 148
  • 5
    One note: a regular expression will be significantly slower than the for loop (if speed is your primary concern). – Mike Pelley May 08 '17 at 23:11
  • 2
    Really? I had no idea! I'm still curious what the regex will be, but that is good to know. – Adam_G May 08 '17 at 23:12
  • 1
    `hafta` means `have to` – Peter Wood May 08 '17 at 23:27
  • 1
    @MikePelley - Actually that's not the case. A replace all `re.sub` when run is a function that runs low level code. It doesn't use language level constructs like a loop or other function calls. It is the fastest way to do it. –  May 08 '17 at 23:34
  • 2
    @sln For loops and similar constructs are compiled quite efficiently by Python, whereas regular expressions are compiled to state machines. Traversing the state machine will rarely be as fast as purpose-built code. For this example, I put together a small test case to illustrate, available [here](https://pastebin.com/svhDjCks). For both the short and long string, the for loop was approximately twice as fast. – Mike Pelley May 09 '17 at 03:12
  • 1
    @MikePelley - Available [here](http://www.tutorialspoint.com/execute_python_online.php?PID=0Bw_CjBb95KQMS0wzNndoVTNNUXM) more like they time the same. The point is that the loop is not the issue, it's what is being done in the loop. I.e. function calls. In this case your splitting on whitespace then bar |. _But that's not what the regex does_. The regex finds [a-zA-Z0-9_] characters followed by a pipe, followed by a word boundary. What would it take to duplicate that without regex ? –  May 09 '17 at 16:33
  • 1
    @sln You should not depend on an online site for accurate timing. Two real-world examples: Core2 Quad running Ubuntu/Python 3.5.1: 4.1s vs 8.1s for string1, 95.1s vs 155.4s for string2. Core i5 running Windows 10/Python 3.6.1: 1.8s vs 5.0s for string1, 56.0s vs 119.0s for string2. I'd be interested in the comparison you mentioned, but it's not really the point. The original poster, whose primary concern was speed, had a straightforward, easy to understand solution to his problem, and the top regex answer is significantly slower and more difficult to understand (for some developers). – Mike Pelley May 10 '17 at 19:33
  • @MikePelley - I don't really use online sites, but I don't have python. I think it's time for any coder to _learn_ how to use regular expressions don't you? The 'regex is too hard' excuse is used too often. As for `\b\w+\|`, it's beginner level. The performance curve to go from regex to a non-regex language is astronomical large given `WordWrap[N]=(?:(?:(?>(.{1,N})(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|[^\S\r\n]))|(.{1,N}))(?:\r?\n)?|(?:\r?\n))`. If you'd like to simulate that kind of thing, that's your thing. But don't try to push an anti-regex agenda ... –  May 10 '17 at 20:22
  • 1
    @sln I'm not anti-regex, and in fact have worked for a regex-related company in the past. Here's the bottom line - Adam_G posted a question about a specific for loop, and asked for an equivalent re.sub expression to increase speed. I indicated that the regex will be slower. You said I was wrong. – Mike Pelley May 10 '17 at 20:34
  • @MikePelley - What is a regex related company? Point is that there is a conflict between the accepted regex _re.sub()_ answer and space split then second element from bar split. The two methods give different results if more than one bar is in a space split. You could combine and do a regex split to get faster results than a space split [see here](http://www.tutorialspoint.com/execute_python_online.php?PID=0Bw_CjBb95KQMMTREMGFGbnp1UkU), but the OP, like most others, only see's what they want to. –  May 10 '17 at 21:51
  • @sln I'm afraid you've run into the limitations of that online site again. On the Windows 10 machine mentioned previously, here are the results: for_sub(string1) 1.8s, rx_split(string1) 4.2s, for_sub(string2) 56.0s, rx_split(string2) 102.6s. As for the company, I was the Director of Software for Solidum Systems, which created the first regular expression engine in silicon. – Mike Pelley May 10 '17 at 22:20
  • @MikePelley - What exactly is the limitations of that online site again? Do they use different core's for different parts of the script, or what exactly you saying? And congratulations on your new position. –  May 10 '17 at 22:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/143922/discussion-between-mike-pelley-and-sln). – Mike Pelley May 10 '17 at 22:29

3 Answers3

2

Something like will work:

Code:

import re
RE_FRONT_HALF = re.compile(r'\w+\|')

sample = "I'm goin|going to be here because I hafta|have to"
print(RE_FRONT_HALF.sub('', sample))

How?

Find one or more word characters followed by a pipe |.

Results:

I'm going to be here because I have to
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
2

You may use a regex that will match 1+ word chars followed with a | symbol:

import re
s = "I'm goin|going to be here because I hafta|have to"
s = re.sub(r'\w+\|\b', '', s)
print(s)
# => I'm going to be here because I have to

See the Python demo

Since the | symbol is always followed with a word char, it is advisable to use \b (word boundary) after it. This way, you will avoid matching one| followed with a space or punctuation (if you prefer to keep those).

See the regex demo:

  • \w+ - 1 or more (due to the + quantifier) word chars (letters, digits, _)
  • \| - a literal | symbol (if not escaped, denotes an alternation operator)
  • \b - a word boundary.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

Note that \w will also match 0-9 digits. If you don't want to match numbers in the word you can use:

import re

s = "I'm goin|going to be here because I hafta|have to"

s = re.sub("[a-zA-z]*\|", "", s)

print(s)
Nick Weseman
  • 1,502
  • 3
  • 16
  • 22