0

I am having a problem concerning newline characters and return characters. Ugh this is hard to explain for me, but I will try.

I have data that exists in list form. The members of the list have newline characters in them such that.

 example_list = ["I've always loved jumping\n\n"]

In order to tokenize this sentence using NLP though NLTK I need the sentence to be in a string. NLTK will ignore newline characters and other escape characters when it tokenizes according to some tests I ran and evidence from the nltk tutorial.

The problem is when i try to convert example_list to a string i get this output

 str(example_list)
 '["I\'ve always loved jumping\\n\\n"]'

Notice that all newline characters have now become an escaped forward slash. Trying to tokenize this yields a terrible result where NLTK thinks that jumping\n\n is one big word because it thinks that the newline characters with two slashes are actually text.

Does anyone know any tricks or good practices to ensure that newline characters never exist in my lists or that these are disregarded or not "double escaped" when converting to a string.

Lastly, Does anyone have any suggestions on learning material relating to how python processes newline characters and how these characters interact with different datatypes and such because it is so confusing.

Thanks a ton!

Kevin
  • 391
  • 3
  • 6
  • 22

2 Answers2

1

You already have strings inside your list. Converting the list to string is (most probably) not the thing you want: This is meant for displaying the list, eg. for debugging.

What you want (I assume) is extracting the strings from the list. This results in newline characters being left as they are. There are (at least) two ways to do that:

For word tokenisation to work best, it's a good idea to do sentence tokenisation first. Your example only shows a list with a single element, which happens to contain a single sentence. In case your data always has one sentence per element (ie. you have sentence-split text already), you can simply do:

for sentence in example_list:
    tokens = word_tokenize(sentence)
    # Do something with the tokens of this sentence...

However, if the text is not sentence-split yet, you need to do that first. There are two possibilities concerning your data, again: Either the list elements are paragraphs or arbitrary fragments.

In the case of paragraphs, the assumption is that each element contains multiple sentences, but the sentences never span across multiple elements. In this case, the code could look like this:

for paragraph in example_list:
    for sentence in sent_tokenize(paragraph):
        tokens = word_tokenize(sentence)
        # Do something with the tokens of this sentence...

In the last case, where the list elements are arbitrary fragments with sentences spanning across multiple elements, we need to join them first. NLTK's tools expect that a sentence is constituted by a contiguous string, so one has to concatenate all fragments first. This is done like this:

text = ''.join(example_list)
for sentence in sent_tokenize(text):
    tokens = word_tokenize(sentence)
    # Do something with the tokens of this sentence...

I hope this gives you some clues!

lenz
  • 5,658
  • 5
  • 24
  • 44
1

You're solving the wrong problem: It's clear from the output you show that you read in a file that actually contains the square brackets, quotes and backslashes. In other words those \n's are not newlines, they are actual \, n sequences. Here's a (triple-quoted, raw) string that reproduces your problem:

>>> mess = r'''["I've always loved jumping\n\n"]'''
>>> str(mess)
'["I\'ve always loved jumping\\n\\n"]'

Of course, you didn't put your data in raw strings; you got it by reading a file that you had created yourself like this:

with open("newfile.txt", "w") as datafile:
    datafile.write(str(list_of_strings))      # <-- Not a good idea

There's your mistake. Writing one string will just output the string contents, but calling write() on a list will print out its repr(), so you end up with quotes and backslashes in the file. Write out your list of strings properly like this:

with open("newfile.txt", "w") as datafile:  
    datafile.writelines(list_of_strings)

... which is basically an abbreviation for this:

with open("newfile.txt", "w") as datafile:
   for s in list_of_strings:
       datafile.write(s)

Do it this way, and when you read your file back in it will behave properly without you having to play games.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • I don't think this is the case for the OP – he clearly shows that the data is given as a list. The `repr` format comes from the `str(...)` call, which is based on the misconception of having to *convert* the list to a string rather than accessing the string elements… – lenz Mar 28 '17 at 16:04
  • You may well be right... it's not uncommon for questions to mangle the real format of the data they're dealing with as they try to simplify, but having now looked at his self-answer, I suspect your interpretation is the correct one. – alexis Mar 28 '17 at 16:46