How do I either remove '\n' from my nltk tokens, or prevent it from appearing in the first place, after converting a string to a list?

Question

I've converted a column from a CSV to a list, and then a string for tokenization. After it's converted to a string I get '\n' throughout. I'm looking to either prevent that from happening completely, or remove it after it happens.

So far, I've tried replace, strip, and rstrip to no avail.

Here's a version where I tried .replace() after converting the list to a string.

df = pd.read_csv('raw_da_qs.csv')
question = df['question_only']
question = question.str.replace(r'\d+','')
question = str(question.tolist())
question = question.replace('\n','')
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
tokens = tokenizer.tokenize(question)

and I end up with tokens like this 'nthere', and 'nsuicide'

Possible duplicate of [Remove New Line from CSV file](https://stackoverflow.com/questions/48970822/remove-new-line-from-csv-file) — pypalms, Jun 21 '19 at 18:37

score 0 · Answer 1 · answered Jun 21 '19 at 21:26

0

I had the same problem and I the only solution I found was to use sed. I hope that someone will share a pythonic way to deal with it.

answered Jun 21 '19 at 21:26

Zigfrid

21
5

qaiser · Answer 2 · 2019-06-27T06:26:50.643

# created one dummy df for this

import pandas as pd
df  = pd.DataFrame(['\n good mrng','\n how are you', '\nwell do\nne'], columns= ['question_only'])

df['replace_n'] = df['question_only'].apply(lambda x: x.replace('\n', ''))

tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
df['token'] = df['replace_n'].apply(lambda x: tokenizer.tokenize(x))

#o/p
df['token']
0       [good, mrng]
1    [how, are, you]
2       [well, done]
Name: token, dtype: object

How do I either remove '\n' from my nltk tokens, or prevent it from appearing in the first place, after converting a string to a list?

2 Answers2