The behavior you are observing in the Langchain recursive text splitter is due to the settings you have provided. Let's break down the code and understand the output.
First, you define a RecursiveCharacterTextSplitter object with a chunk_size of 10 and chunk_overlap of 0. The chunk_size parameter determines the maximum size of each chunk, while the chunk_overlap parameter specifies the number of characters that should overlap between consecutive chunks. In your case, the chunks will not overlap.
Next, you define a test string test with a length of 13 characters. The string contains newline characters ("\n") at specific positions.
When you call r_splitter.split_text(test), the text splitter algorithm processes the input text according to the given parameters. Since the chunk_size is set to 10 and there is no overlap between chunks, the algorithm tries to split the text into chunks of size 10.
The splitting process takes into account the separators you have specified. However, in your code, the separators parameter is commented out (# separators=["\n"]). As a result, the algorithm does not treat newline characters as separators.
The algorithm starts from the beginning of the input text and tries to split it into chunks of size 10. It finds a newline character ("\n") at index 1 and determines that it cannot split the text at this position while maintaining the chunk size of 10. Thus, it continues to the next index.
At index 7, the algorithm finds another newline character. Since it cannot split the text at this position while maintaining the chunk size of 10, it stops the current chunk and starts a new chunk with the remaining text.
Therefore, the output you see is ['a\nbcefg', 'hij\nk'], where the first chunk is 'a\nbcefg' (7 characters) and the second chunk is 'hij\nk' (5 characters).
If you want to split the text at every newline character, you need to uncomment the separators parameter and provide "\n" as a separator. Here's the updated code:
from langchain.text_splitter import RecursiveCharacterTextSplitter
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=10,
chunk_overlap=0,
separators=["\n"]
)
test = """a\nbcefg\nhij\nk"""
print(len(test))
tmp = r_splitter.split_text(test)
print(tmp)
With this modification, the output will be ['a', 'bcefg', 'hij', 'k'], as you expected. Each newline character will be treated as a separator, resulting in separate chunks for each part of the text.