There are a few distinct problems, here.
1. read vs readlines
data = text.readlines()
This produces a list
of str
, good.
... str(data) ...
If you print this, you will see it contains
several characters you likely did not want: [
, '
, ,
, ]
.
You'd be better off with just data = text.read()
.
2. split on str vs regex
str(data).split('([.|?])')
We are splitting on a string, ok.
Let's consult the fine documents.
Return a list of the words in the string, using sep as the delimiter string.
Notice there's no mention of a regular expression.
That argument does not appear as sequence of seven characters in the source string.
You were looking for a similar function:
https://docs.python.org/3/library/re.html#re.split
3. char class vs alternation
We can certainly use |
vertical bar for alternation,
e.g. r"(cat|dog)"
.
It works for shorter strings, too, such as r"(c|d)"
.
But for single characters, a character class is
more convenient: r"[cd]"
.
It is possible to match three characters,
one of them being vertical bar, with r"[c|d]"
or equivalently r"[cd|]"
.
A character class can even have just a single character,
so r"[c]"
is identical to r"c"
.
4. escaping
Since r".*"
matches whole string,
there are certainly cases where escaping dot is important,
e.g. r"(cat|dog|\.)"
.
We can construct a character class with escaping:
r"[cd\.]"
.
Within [
]
square brackets that \
backwhack is optional.
Better to simply say r"[cd.]"
, which means the same thing.
pattern = re.compile(r"[.?]")
5. findall vs split
The two functions are fairly similar.
But findall() is about retrieving matching elements,
which your "preserve the final punctuation"
requirement asks for,
while split() pretty much assumes
that the separator is uninteresting.
So findall() seems a better match for your use case.
pattern = re.compile(r"[^.?]+[.?]")
Note that ^
caret usually means "anchor
to start of string", but within a character class
it is negation.
So e.g. r"[^0-9]"
means "non-digit".
data = text.readlines()
split = str(data).split('([.|?])')
Putting it all together, try this:
data = text.read()
pattern = re.compile(r"[^.?]+[.?]")
sentences = pattern.findall(data)
If there's no trailing punctuation in the source string,
the final words won't appear in the result.
Consider tacking on a "."
period in that case.