7

The following code is current code that i use to remove \n in ['text'] column:

df = pd.read_csv('file1.csv')

df['text'].replace('\s+', ' ', regex=True, inplace=True) # remove extra whitespace
df['text'].replace('\n',' ', regex=True) # remove \n in text

header = ["text", "word_length", "author"]

df_out = df.to_csv('sn_file1.csv', columns = header, sep=',', encoding='utf-8')

I've tried too from the suggestions:

df['text'].replace('\n', '')
df['text'] = df['text'].str.replace('\n', '').str.replace('\s+', ' ').str.strip()

Output: ' What a smartass! \nLike he knows anything about real estate deals too...'

The code to remove whitespace is working. But not in removing the \n. Anyone can help me on this matter? Thanks.

I've tried to solve based on the suggestion from this link too removing newlines from messy strings in pandas dataframe cells? but it's still not working.

Solved:

df['text'].replace(r'\s+|\\n', ' ', regex=True, inplace=True) 
Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
Lily
  • 135
  • 1
  • 2
  • 9
  • how is `df['text'].replace('\n', '')` working? – anky Sep 10 '18 at 08:56
  • @anky_91 ive tried but it's still the same. But thanks for suggesting – Lily Sep 10 '18 at 09:00
  • `\s` matches newlines as well, so it should work, unless your input string contains an actual backslash, followed by a literal `n` instead of a linebreak. – Tim Pietzcker Sep 10 '18 at 09:01
  • Does `df['text'] = df['text'].str.replace('\n', '').str.replace('\s+', ' ').str.strip()` do what you're after? – Jon Clements Sep 10 '18 at 09:01
  • @TimPietzcker supposedly `\n` in the text that i'd retrieved is a breakline. But how if it has changed to an actual backslash, followed by a literal `n` as u mentioned? How can i work from it? – Lily Sep 10 '18 at 09:12
  • @JonClements nope is not working too but Thanks – Lily Sep 10 '18 at 09:13
  • 1
    @Lily can you [edit] your question then with the offered solutions and their results and how they differ from your expectations please? At this moment... "nope is not working" is not helping anyone see an approach that possibly could. Thanks. – Jon Clements Sep 10 '18 at 09:15
  • 7
    It sounds as if there is no newline at all, but a ``\`` + `n`. If you use `df['text'].replace(r'\s+|\\n', ' ', regex=True, inplace=True)`, does it disappear? – Wiktor Stribiżew Sep 10 '18 at 09:19
  • @JonClements I suggest closing this as a typo. – Wiktor Stribiżew Sep 10 '18 at 09:27

1 Answers1

3

Considering one wants to apply the changes to the column 'texts', select that column as

df['text']

Then, to achieve that, one might use pandas.DataFrame.replace.

This lets one can pass regular expressions, regex=True, which will interpret both the strings in both lists as regexs (instead of matching them directly).

Picking up on @Wiktor Stribiżew suggestion, the following will do the work

df['text'] = df['text'].replace(r'\s+|\\n', ' ', regex=True) 

This regular expression syntax reference may be of help.

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83