I am trying to use RecursiveCharacterTextSplitter with length function as 3rd party tokenizer.
As per documentation RecursiveCharacterTextSplitter accepts lengthFunction as promise also
but I am hitting TypeError: Cannot convert undefined to a BigInt
when I run below piece of code. Also I only get print for console.log(tok === undefined); before I hit error message.
The code works if pass a promise wrapped int with dummy delay.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { BartTokenizer } from "@xenova/transformers";
// interface TextSplitterParams {
// chunkSize: number;
// chunkOverlap: number;
// keepSeparator: boolean;
// lengthFunction?: ((text: string) => number) | ((text: string) => Promise<number>);
// }
async function tokenizer_len(x) {
return BartTokenizer.from_pretrained("facebook/bart-large").then(
(tok) => tok(x)["input_ids"].size
);
}
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1024,
chunkOverlap: 50,
lengthFunction: tokenizer_len,
});
async function chunkify(text) {
let output = await splitter.createDocuments([text]);
return output.map((key) => key.pageContent);
}
const data1 = fs.readFileSync("text.txt", "utf8").toString();
chunkify(data1).then((op) => {
console.log(op);
});
Error
TypeError: Cannot convert undefined to a BigInt
at BigInt (<anonymous>)
at Array.map (<anonymous>)
at Function._call (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/@xenova/transformers/src/tokenizers.js:2325:50)
at closure (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/@xenova/transformers/src/utils/core.js:62:28)
at RecursiveCharacterTextSplitter.tokenizer_len [as lengthFunction] (file:///media/instantinopaul/data/Code/ML/js-summarize/main.mjs:14:18)
at async RecursiveCharacterTextSplitter._splitText (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/langchain/dist/text_splitter.js:236:18)
at async RecursiveCharacterTextSplitter.createDocuments (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/langchain/dist/text_splitter.js:76:33)
at async chunkify (file:///media/instantinopaul/data/Code/ML/js-summarize/main.mjs:27:16)