0

I am trying to use RecursiveCharacterTextSplitter with length function as 3rd party tokenizer.

As per documentation RecursiveCharacterTextSplitter accepts lengthFunction as promise also

but I am hitting TypeError: Cannot convert undefined to a BigInt when I run below piece of code. Also I only get print for console.log(tok === undefined); before I hit error message. The code works if pass a promise wrapped int with dummy delay.


import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { BartTokenizer } from "@xenova/transformers";

// interface TextSplitterParams {
//     chunkSize: number;
//     chunkOverlap: number;
//     keepSeparator: boolean;
//     lengthFunction?: ((text: string) => number) | ((text: string) => Promise<number>);
// }

async function tokenizer_len(x) {
  return BartTokenizer.from_pretrained("facebook/bart-large").then(
    (tok) => tok(x)["input_ids"].size
  );
}

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 50,
  lengthFunction: tokenizer_len,
});

async function chunkify(text) {
  let output = await splitter.createDocuments([text]);
  return output.map((key) => key.pageContent);
}

const data1 = fs.readFileSync("text.txt", "utf8").toString();
chunkify(data1).then((op) => {
  console.log(op);
});

Error

TypeError: Cannot convert undefined to a BigInt
    at BigInt (<anonymous>)
    at Array.map (<anonymous>)
    at Function._call (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/@xenova/transformers/src/tokenizers.js:2325:50)
    at closure (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/@xenova/transformers/src/utils/core.js:62:28)
    at RecursiveCharacterTextSplitter.tokenizer_len [as lengthFunction] (file:///media/instantinopaul/data/Code/ML/js-summarize/main.mjs:14:18)
    at async RecursiveCharacterTextSplitter._splitText (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/langchain/dist/text_splitter.js:236:18)
    at async RecursiveCharacterTextSplitter.createDocuments (file:///media/instantinopaul/data/Code/ML/js-summarize/node_modules/langchain/dist/text_splitter.js:76:33)
    at async chunkify (file:///media/instantinopaul/data/Code/ML/js-summarize/main.mjs:27:16)
Sayan Dey
  • 771
  • 6
  • 13
  • Are you sure `data.size` is the correct property? – Barmar Aug 31 '23 at 21:33
  • In general, you don't need to use `.then()` inside async functions. Use `await`. It will make the code much less confusing, and it should be easier to add debugging statements. – Barmar Aug 31 '23 at 21:35
  • @Barmar thanks for the pointer, so I reduced the text input and now I hit *op is not a function* so I change ``return op.then((data) => data["input_ids"]).then((data) => data.size);``` to ```return op["input_ids"].size;```, but the undefined error persists and yes ```return op["input_ids"].size;```, it prints valid for one iteration. – Sayan Dey Aug 31 '23 at 22:13
  • Your code doesn't use `op` as a function, so I'm not sure where that error comes from. It expects it to be a promise. – Barmar Aug 31 '23 at 22:14
  • I think somehow the RecursiveCharacterTextSplitter is not waiting sync in createDocuments for func, I put debug statements as well, for very short text it is printing correctly as expected but failing with undefined asynchronously. – Sayan Dey Aug 31 '23 at 22:15
  • Since it allows `lengthFunction` to be a promise, it should await it. – Barmar Aug 31 '23 at 22:15
  • Where can I find the documentation of `BartTokenizer.from_pretrained`? – Barmar Aug 31 '23 at 22:18
  • barttokenizer is extended class https://github.com/xenova/transformers.js/blob/0c2dcc74987502c2b17b03eda4f8dce685c5b4df/src/tokenizers.js#L2566 over https://github.com/xenova/transformers.js/blob/0c2dcc74987502c2b17b03eda4f8dce685c5b4df/src/tokenizers.js#L3673 – Sayan Dey Aug 31 '23 at 22:22
  • I was hoping for documentation. I don't want to read all the code to figure out which values are promises that need to be awaited. – Barmar Aug 31 '23 at 22:24
  • Looks like the jsdoc comments may be good enough. – Barmar Aug 31 '23 at 22:28
  • I could find out https://js.langchain.com/docs/api/text_splitter/interfaces/RecursiveCharacterTextSplitterParams, the error is coming from RecursiveCharacterTextSplitter, this link has doc on accepted params – Sayan Dey Aug 31 '23 at 22:35
  • From what I can tell, `tok(x)` is synchronous, it doesn't return a promise, so you don't need to await it. It just returns a plain object. So use `let op = tok(x);` and then you can get `op.input_ids` and `op.attention_mask`. – Barmar Aug 31 '23 at 22:36
  • `input_ids` is an array of numbers or a `Tensor`. I don't see a `size` property in the `Tensor` class. – Barmar Aug 31 '23 at 22:38
  • checked it already with Object.keys, these are the available in tensor ```[ 'dims', 'type', 'data', 'size' ]```, also ```let op = tok(x);``` doesn't help – Sayan Dey Aug 31 '23 at 22:42
  • the catch is the whole code runs successfully with debug prints when input is small. – Sayan Dey Aug 31 '23 at 22:44
  • Sorry, I'm not familiar with these libraries, so I can't figure out the proper sequence. – Barmar Aug 31 '23 at 22:46
  • thanks, for helping anyway, now I atleast know tok(x) is not promise but object so editted the asyn func a bit. I checked tok(x) separately, it is in an object, as u said. – Sayan Dey Aug 31 '23 at 23:01

0 Answers0