4

I don't understand the following behavior of Langchain recursive text splitter. Here is my code and output.

from langchain.text_splitter import RecursiveCharacterTextSplitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
#     separators=["\n"]#, "\n", " ", ""]
)
test = """a\nbcefg\nhij\nk"""
print(len(test))
tmp = r_splitter.split_text(test)
print(tmp)

Output

13
['a\nbcefg', 'hij\nk']

As you can see, it outputs chunks of size 7 and 5 and only splits on one of the new line characters. I was expecting output to be ['a','bcefg','hij','k']

GreenEye
  • 153
  • 2
  • 14

3 Answers3

2

Accord to the split_text funcion in RecursiveCharacterTextSplitter

def split_text(self, text: str) -> List[str]:
    """Split incoming text and return chunks."""
    final_chunks = []
    # Get appropriate separator to use
    separator = self._separators[-1]
    for _s in self._separators:
        if _s == "":
            separator = _s
            break
        if _s in text:
            separator = _s
            break
    # Now that we have the separator, split the text
    if separator:
        splits = text.split(separator)
    else:
        splits = list(text)
    # Now go merging things, recursively splitting longer texts.
    _good_splits = []
    for s in splits:
        if self._length_function(s) < self._chunk_size:
            _good_splits.append(s)
        else:
            if _good_splits:
                merged_text = self._merge_splits(_good_splits, separator)
                final_chunks.extend(merged_text)
                _good_splits = []
            other_info = self.split_text(s)
            final_chunks.extend(other_info)
    if _good_splits:
        merged_text = self._merge_splits(_good_splits, separator)  # Here will merge the items if the cusum is less than chunk size in your example is 10
        final_chunks.extend(merged_text)
    return final_chunks

this will merge the items if the cusum is less than chunk size in your example is 10

Xiaomin Wu
  • 400
  • 1
  • 5
  • 1
    I looked at the source code further into the _merge_splits function (https://github.com/hwchase17/langchain/blob/master/langchain/text_splitter.py), and it makes sense now. – GreenEye Jul 07 '23 at 13:54
1

when working with llms, we dont count characters we count the tokens. if you use this tool https://platform.openai.com/tokenizer

enter image description here

you said, it outputs chunks of size 7 and 5, a total of 12 tokens.

you are using Fixed-size chunking. from chunking-strategies/,

Fixed-size chunking

This is the most common and straightforward approach to chunking: we simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. In general, we will want to keep some overlap between chunks to make sure that the semantic context doesn’t get lost between chunks. Fixed-sized chunking will be the best path in most common cases. Compared to other forms of chunking, fixed-sized chunking is computationally cheap and simple to use since it doesn’t require the use of any NLP libraries.

Yilmaz
  • 35,338
  • 10
  • 157
  • 202
-4

The behavior you are observing in the Langchain recursive text splitter is due to the settings you have provided. Let's break down the code and understand the output.

First, you define a RecursiveCharacterTextSplitter object with a chunk_size of 10 and chunk_overlap of 0. The chunk_size parameter determines the maximum size of each chunk, while the chunk_overlap parameter specifies the number of characters that should overlap between consecutive chunks. In your case, the chunks will not overlap.

Next, you define a test string test with a length of 13 characters. The string contains newline characters ("\n") at specific positions.

When you call r_splitter.split_text(test), the text splitter algorithm processes the input text according to the given parameters. Since the chunk_size is set to 10 and there is no overlap between chunks, the algorithm tries to split the text into chunks of size 10.

The splitting process takes into account the separators you have specified. However, in your code, the separators parameter is commented out (# separators=["\n"]). As a result, the algorithm does not treat newline characters as separators.

The algorithm starts from the beginning of the input text and tries to split it into chunks of size 10. It finds a newline character ("\n") at index 1 and determines that it cannot split the text at this position while maintaining the chunk size of 10. Thus, it continues to the next index.

At index 7, the algorithm finds another newline character. Since it cannot split the text at this position while maintaining the chunk size of 10, it stops the current chunk and starts a new chunk with the remaining text.

Therefore, the output you see is ['a\nbcefg', 'hij\nk'], where the first chunk is 'a\nbcefg' (7 characters) and the second chunk is 'hij\nk' (5 characters).

If you want to split the text at every newline character, you need to uncomment the separators parameter and provide "\n" as a separator. Here's the updated code:

from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
    separators=["\n"]
)

test = """a\nbcefg\nhij\nk"""
print(len(test))
tmp = r_splitter.split_text(test)
print(tmp)

With this modification, the output will be ['a', 'bcefg', 'hij', 'k'], as you expected. Each newline character will be treated as a separator, resulting in separate chunks for each part of the text.

Dawam Raja
  • 437
  • 4
  • 3