2

I have a dictionary consisting of a database primary key with a string.

self.mydict = {
1:'a small example'
,2:'some sentence'
,3:'a very long string around 30k characters'
}

For key value pairs where the string is length<1000, everything tokenizes as I would expect.
For a few very large strings (length=30,000), the tokenizer returns multiple broken lines in my csv output.

def write_data(self):
    headers=[]
    for x,y in self.mydict.items():
        headers.append([word_tokenize(y)])
        print(len(y))

    with open(self.outputdata, 'w', encoding='utf-8', newline='') as f:
        writer = csv.writer(f)
        for item in headers:
            writer.writerow(item)

Writing my results to a csv, I get the following:

['a','small','example']
['some','sentence']
['a','very','long',
string','around','30k','characters']"

So the 30k length string breaks for some reason, and appears to split onto another line. I can truncate the first ~1000 characters of the strings and this problem goes away, but I'd prefer to keep the long strings as I'm doing natural language processing. Is this bug due to the length of the string or the way I'm writing my csv?

barker
  • 1,005
  • 18
  • 36
  • Are you getting the brackets in your output file? `word_tokenize(y)` is already a list, don't enclose it in brackets. – alexis Jun 05 '19 at 20:48
  • thanks alexis, no i'm appending multiple things to create a matrix, just cut out part of the code so it looks wacky. – barker Jun 06 '19 at 21:59

1 Answers1

1

No, there are no string length limit to the NLTK's word_tokenize() function.

But csv.writer has a limit to the field size, see https://docs.python.org/3.4/library/csv.html?highlight=csv#csv.field_size_limit

alvas
  • 115,346
  • 109
  • 446
  • 738
  • ah, problems with the csv i see. this question helped for anyone else that finds this post: https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072 – barker Jun 06 '19 at 21:57