I am trying to create chunks (max) 350 characters long with 100 chunk overlap.
I understand that chunk_size
is an upper limit, so I may get chunks shorter than that. But why am I not getting any chunk_overlap
?
Is it because the overlap also has to split on one of the separator chars? So it's 100 chars chunk_overlap if there is a separator
within 100 chars of the split that it can split on?
from langchain.text_splitter import RecursiveCharacterTextSplitter
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=350,
chunk_overlap=100,
separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
x = r_splitter.split_text(some_text)
print(x)
for thing in x:
print(len(thing))
Output
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']
248
243