3

I've converted a column from a CSV to a list, and then a string for tokenization. After it's converted to a string I get '\n' throughout. I'm looking to either prevent that from happening completely, or remove it after it happens.

So far, I've tried replace, strip, and rstrip to no avail.

Here's a version where I tried .replace() after converting the list to a string.

df = pd.read_csv('raw_da_qs.csv')
question = df['question_only']
question = question.str.replace(r'\d+','')
question = str(question.tolist())
question = question.replace('\n','')
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
tokens = tokenizer.tokenize(question)

and I end up with tokens like this 'nthere', and 'nsuicide'

Maggie
  • 31
  • 3

2 Answers2

0

I had the same problem and I the only solution I found was to use sed. I hope that someone will share a pythonic way to deal with it.

Zigfrid
  • 21
  • 5
0
# created one dummy df for this

import pandas as pd
df  = pd.DataFrame(['\n good mrng','\n how are you', '\nwell do\nne'], columns= ['question_only'])

df['replace_n'] = df['question_only'].apply(lambda x: x.replace('\n', ''))

tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
df['token'] = df['replace_n'].apply(lambda x: tokenizer.tokenize(x))

#o/p
df['token']
0       [good, mrng]
1    [how, are, you]
2       [well, done]
Name: token, dtype: object
qaiser
  • 2,770
  • 2
  • 17
  • 29