3

I'm writing a program which jumbles clauses within a text using punctuation marks as delimiters for when to split the text.

At the moment my code has a large list where each item is a group of clauses.

import re
from random import shuffle
clause_split_content = []

text = ["this, is. a test?", "this: is; also. a test!"]

for i in text:
        clause_split = re.split('[,;:".?!]', i)
        clause_split.remove(clause_split[len(clause_split)-1])
        for x in range(0, len(clause_split)):
                clause_split_content.append(clause_split[x])
shuffle(clause_split_content)
print(*content, sep='')

at the moment the result jumbles the text without retaining the punctuation which is used as the delimiter to split it. The output would be something like this:

a test this also this is a test is

I want to retain the punctuation within the final output so it would look something like this:

a test! this, also. this: is. a test? is;
  • Why split it on the punctuation? Can't you just take each index in the list and append it as a single string? – Captain Caveman May 27 '22 at 17:06
  • In my program each item in the list is a line of text within a larger text. However, within each line there is punctuation which I need to be able split further. – user7266757 May 27 '22 at 17:14
  • I'm not certain I understand your question. Is the answer below close? – Captain Caveman May 27 '22 at 17:18
  • Does this answer your question? [In Python, how do I split a string and keep the separators?](https://stackoverflow.com/questions/2136556/in-python-how-do-i-split-a-string-and-keep-the-separators) – G. Anderson May 27 '22 at 17:20

2 Answers2

-1

I think you are simply using the wrong function of re for your purpose. split() excludes your separator, but you can use another function e.g. findall() to manually select all words you want. For example with the following code I can create your desired output:

import re
from random import shuffle

clause_split_content = []

text = ["this, is. a test?", "this: is; also. a test!"]

for i in text:
    words_with_seperator = re.findall(r'([^,;:".?!]*[,;:".?!])\s?', i)
    clause_split_content.extend(words_with_seperator)
    
shuffle(clause_split_content)
print(*clause_split_content, sep=' ')

Output:

this, this: is. also. a test! a test? is;

The pattern ([^,;:".?!]*[,;:".?!])\s? simply takes all characters that are not a separator until a separator is seen. These characters are all in the matching group, which creates your result. The \s? is only to get rid of the space characters in between the words.

JANO
  • 2,995
  • 2
  • 14
  • 29
-1

Here's a way to do what you've asked:

import re
from random import shuffle
text = ["this, is. a test?", "this: is; also. a test!"]
content = [y for x in text for y in re.findall(r'([^,;:".?!]*[,;:".?!])', x)]
shuffle(content)
print(*content, sep=' ')

Output:

 is;  is.  also.  a test? this,  a test! this:

Explanation:

  • the regex pattern r'([^,;:".?!]*[,;:".?!])' matches 0 or more non-separator characters followed by a separator character, and findall() returns a list of all such non-overlapping matches
  • the list comprehension iterates over the input strings in list text and has an inner loop that iterates over the findall results for each input string, so that we create a single list of every matched pattern within every string.
  • shuffle and print are as in your original code.
constantstranger
  • 9,176
  • 2
  • 5
  • 19