1
p1 = re.compile(r"https?:[^\s]+[a-zA-Z0-9]")

p2 = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)

I would like to consolidate these two patterns into one and then I can use the 'split' function to split text based on the unified regular expressions. How to do that? Is there kind of pattern union operation, such as:

p = p1 + p2

p1 is a pattern to match URL string, and p2 is a pattern to split text into blocks based on some characters. I want to get a new pattern that match either p1 or p2. This is in Python.

Illustrate with examples:

text = This is a https://www.stackoverflow.com/posts/32244/edits example.

If I just apply p2, the text will be split into:

['This', ' ', 'is', ' ', 'a', ' ','https', '://', 'www.stackoverflow.com', '/', 'posts', '/', '32244', '/', 'edits', 'example']

I don't want to split the URL and I want to get these chunks:

['This',' ', 'is', ' ',  'a', ' ', 'https://www.stackoverflow.com/posts/32244/edits', ' ', 'example', '.']

That's why I want to add p1 for the URL keeping pattern. My description above with p = p1 + p2 may not be accurate.

marlon
  • 6,029
  • 8
  • 42
  • 76
  • 3
    Are you looking for `|`? As in, `pat1|pat2`, which matches one pattern or the other. – ggorlen Aug 04 '20 at 04:24
  • @ggorlen No reason why that can't be an answer... – Tim Biegeleisen Aug 04 '20 at 04:24
  • @ggorlen, yes. How to apply the '|' into the above two patterns? – marlon Aug 04 '20 at 04:25
  • @TimBiegeleisen seems too trivial to answer. Surely it's a dupe or not worth having around. Also, it wasn't entirely clear to me OP wants alternation not concatenation based on the question. @marlon substitute the two patterns for `pat1` and `pat2`. – ggorlen Aug 04 '20 at 04:27
  • @ggorlen, please see my updates, which is much clearer. It's not exactly the OR operation I think. – marlon Aug 04 '20 at 04:35
  • You are in fact looking exactly for the OR operation. `(x|y)` matches `x` if it can, but if it can't, it falls back and tries `y`. – tripleee Aug 04 '20 at 04:38
  • @tripleee, Does it look like this: re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+|https?:[^\s]+[a-zA-Z0-9])", re.U) – marlon Aug 04 '20 at 04:41
  • OK, yeah, I think you had an [x-y problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) to begin with. Thanks for recognizing/updating. – ggorlen Aug 04 '20 at 04:41
  • 1
    You want the more specific pattern first of course, and then fall back to the less specific. – tripleee Aug 04 '20 at 04:42
  • @ggorlen, please write an answer. Very appreciate! – marlon Aug 04 '20 at 04:43

1 Answers1

1

I don't think a split operation is appropriate here--it's easier to define the matches positively by stating which subpatterns you do want rather than where they're delimited. Although the spec is left to be inferred, your groups appear to be:

  1. One or more spaces ( +).
  2. Any sequence of characters starting with \bhttp and not involving a space (\bhttp[^ ]+).
  3. Any sequence of word characters (\b\w+).
  4. Any sequence of nonword, nonspace characters (punctuation, etc) (\b[\S\W]+).

Join the different possibilities in an alternation:

>>> re.findall(r" +|\bhttp[^ ]+|\b\w+|\b[\S\W]+", text)
['This', ' ', 'is', ' ', 'a', ' ', 'https://www.stackoverflow.com/posts/32244/edits', ' ', 'example', '.']
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • I will try your pattern, but there is a reason to try to stick to the original p2 pattern, since this is for non-English text but I am using English text above to illustrate the idea. – marlon Aug 04 '20 at 04:46
  • Please ask about your actual spec. If you overtrivialize it you can expect inaccurate answers since there's no way for me to know. Can you post a small, representative snippet of the actual text and the corresponding expected output? – ggorlen Aug 04 '20 at 04:49
  • Your answer is helpful. I will figure out the rest. The actual one is complicated and I can't come up with a small but complete example. – marlon Aug 04 '20 at 04:58
  • If you can write your answer as using '|', it may be eaisier for me to use it in my case. – marlon Aug 04 '20 at 05:00
  • I only want to split based on these delimiters defined by p2, and if it's a URL, I don't want to split anything inside of the URL. Is this clear? @ggorlen – marlon Aug 04 '20 at 05:02
  • No, not really. I can only answer based on what you wrote, not what you didn't write. I did use `|`. You're asking me to join your two regexes but I have no idea what your input or expected output are other than the contrived version that doesn't match your actual use case. So there's nothing I can do until you can minimize it. Joining your two regexes with a pipe won't help you on your provided example because the literal `.` character would be in the same group as the `a-z`, so you'll never get `["sample", "."]`. – ggorlen Aug 04 '20 at 05:03
  • The original code work with p2 only, and now I want to extend it to not break a URL. – marlon Aug 04 '20 at 05:05
  • Then just join them with a pipe as in my initial comment... But use the `http` one first so it's matched with a higher priority. – ggorlen Aug 04 '20 at 05:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219159/discussion-between-marlon-and-ggorlen). – marlon Aug 04 '20 at 05:06