4

I have a large corpus of text data that I'm pre-processing for document classification with MALLET using openrefine.

Some of the cells are long (>150,000 characters) and I'm trying to split them into <1,000 word/token segments.

I'm able to split long cells into 6,000 character chunks using the "Split multi-valued cells" by field length, which roughly translates to 1,000 word/token chunks, but it splits words across rows, so I'm losing some of my data.

Is there a function I could use to split long cells by the first whitespace (" ") after every 6,000th character, or even better, split every 1,000 words?

DFM
  • 43
  • 4

2 Answers2

2

Here is my simple solution:

Go to Edit cells -> Transform and enter

value.replace(/((\s+\S+?){999})\s+/,"$1@@@")

This will replace every 1000th whitespace (consecutive whitespaces are counted as one and replaced if they appear at the split border) with @@@ (you can choose any token you like, as long as it doesn't appear in the original text).

The go to Edit cells -> Split multi-valued cells and split using the token @@@ as separator.

Frog23
  • 173
  • 1
  • 4
  • 2
    Good idea. I wonder if we can not modify this regex to split only after a dot, Which will allow to get complete sentences in each chunk: `value.replace(/((\s+\S+?){999,})\S+(\.)/,"$0@@@")` – Ettore Rizza Apr 05 '18 at 10:33
  • 1
    Thanks. The Python/Jython supplied below is clearer on how it works, but this is easier to implement. Splitting after a period is also a smart idea. How could I modify the function to split after a period, question mark, or exclamation mark? – DFM Apr 05 '18 at 14:22
  • 1
    Just use this one: `value.replace(/((\s+\S+?){999,})\S+[\.\?!]/,"$0@@@")` . It now splits into segments that are at least 1000 characters long and goes then until the next period, question mark or exclamation mark. – Frog23 Apr 05 '18 at 17:53
  • Mmh, looks like there is something wrong with the last regex. Probably an infinite loop somewhere. – Ettore Rizza Apr 05 '18 at 20:13
  • That's strange, because it works for me. Do you get an error message or is just nothing happening. Is your test corpus large enough for the 1000 words? For testing I always reduce the 999 to a 9. Maybe you could try this. – Frog23 Apr 05 '18 at 21:09
  • Sorry, I restarted OR and now it works. Maybe a java bug. – Ettore Rizza Apr 05 '18 at 21:30
  • Update: it's worked on some, but not all of my datasets. Here's the output when the transformation fails/gets stuck: `at java.util.regex.Pattern$GroupTail.match(unknown Source) at java.util.regex.Pattern$Curly.match1(unknown Source) at java.util.regex.Pattern$Curly.match(unknown Source) at java.util.regex.Pattern$Curly.match0(unknown Source) at java.util.regex.Pattern$Curly.match1(unknown Source) at java.util.regex.Pattern$GroupHead.match(Unknown Source) at java.util.regex.Pattern$Loop.match(Unknown Source)` – DFM Apr 07 '18 at 14:52
  • That seems to be an [old java bug](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=5050507). Unfortunately I can not really test it here, without the original data to test on. It could be that the matched segments are too long, in which case you might want to add additional characters to split the string (e.g. commas), Or allow for smaller segments (use 299 instead of 999). Alternatively you could just go with the python solution from @ettore-rizza . – Frog23 Apr 10 '18 at 09:25
  • Thanks. In the files where I encountered errors I ended up going with your initial answer above, splitting after 1k whitespaces, rather than the first period after 1k whitespaces. Given the size of the corpora I'm working with, I'm not worried that splitting a few sentences across documents will meaningfully affect MALLET's LDA. – DFM Apr 27 '18 at 16:09
1

The simplest way is probably to split your text by spaces, to insert a very rare character (or group of characters) after each group of 1000 elements, to reconcatenate, then to use "Split multivalued cells" with your weird character(s).

You can do that in GREL, but it will be much clearer by choosing "Python/Jython" as script language.

So: Edit cells -> Transform -> Python/Jython:

my_list = value.split(' ')

n = 1000
i = n
while i < len(my_list):
    my_list.insert(i, '|||')
    i+= (n+1)

return " ".join(my_list)

(For an explanation of this script, see here)

Here is a more compact version :

text = value.split(' ')
n = 1000
return "|||".join([' '.join(text[i:i+n]) for i in range(0,len(text),n)])

You can then split using ||| as separator.


If you prefer to split by characters instead of words, looks like you can do that in two lines with textwrap :

import textwrap

return "|||".join(textwrap.wrap(value, 6000))
Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23