I have a large corpus of text data that I'm pre-processing for document classification with MALLET using openrefine.
Some of the cells are long (>150,000 characters) and I'm trying to split them into <1,000 word/token segments.
I'm able to split long cells into 6,000 character chunks using the "Split multi-valued cells" by field length, which roughly translates to 1,000 word/token chunks, but it splits words across rows, so I'm losing some of my data.
Is there a function I could use to split long cells by the first whitespace (" ") after every 6,000th character, or even better, split every 1,000 words?