I have a dictionary consisting of a database primary key with a string.
self.mydict = {
1:'a small example'
,2:'some sentence'
,3:'a very long string around 30k characters'
}
For key value pairs where the string is length<1000, everything tokenizes as I would expect.
For a few very large strings (length=30,000), the tokenizer returns multiple broken lines in my csv output.
def write_data(self):
headers=[]
for x,y in self.mydict.items():
headers.append([word_tokenize(y)])
print(len(y))
with open(self.outputdata, 'w', encoding='utf-8', newline='') as f:
writer = csv.writer(f)
for item in headers:
writer.writerow(item)
Writing my results to a csv, I get the following:
['a','small','example']
['some','sentence']
['a','very','long',
string','around','30k','characters']"
So the 30k length string breaks for some reason, and appears to split onto another line. I can truncate the first ~1000 characters of the strings and this problem goes away, but I'd prefer to keep the long strings as I'm doing natural language processing. Is this bug due to the length of the string or the way I'm writing my csv?