Removing /N character from a column in Python Dataframe

Question

I have a column with the headlines of articles. The headlines are like this(in Greek): [\n, [Μητσοτάκης: Έχει μεγάλη σημασία οι φωτισ..

How can I remove this character: [\n, ?

I have tried this but nothing happened:

 df['Title'].replace('\n', '', regex=True)

Please clarify whether you're looking to remove the character `\n` from every entry, or if you want to replace any entries that are solely `\n` to be blank? It would help if you included example input and expected output — Pranav Hosangadi, May 19 '21 at 17:16

BlivetWidget · Answer 1 · 2021-05-19T16:56:46.213

2

.replace() does not change the dataframe by default, it returns a new dataframe. Use the inplace pararameter.

>>> import pandas
>>> df = pandas.DataFrame([{"x": "a\n"}, {"x": "b\n"}, {"x": "c\n"}])
>>> df['x'].replace('\n', '', regex=True)  # does not change df
0    a
1    b
2    c
Name: x, dtype: object
>>> df  # df is unchanged
     x
0  a\n
1  b\n
2  c\n
>>> df['x'].replace('\n', '', regex=True, inplace=True)
>>> df  # df is changed
   x
0  a
1  b
2  c

edited May 19 '21 at 16:56

answered May 19 '21 at 16:45

BlivetWidget

10,543
1
14
23

1

It looks like OP wants to replace a part of text with other text. They are looking for the vectorized string replace which runs the replace on each element of the column, not `df.replace()` which replaces entire elements – Pranav Hosangadi May 19 '21 at 16:49
I'm sorry, but you misunderstand their problem. The only problem is that they did not use inplace. Their approach works fine if they do. – BlivetWidget May 19 '21 at 16:54
I believe my response still does what they want, but until they provide some examples we can only guess. – BlivetWidget May 19 '21 at 17:40
1

Ah, you're right. The `regex=True` makes it use `re.sub()` under the hood which replaces substrings. Sorry about that! Have a +1 :) – Pranav Hosangadi May 19 '21 at 17:43
No worries at all. Core Python is pretty logical but often the 3rd party libraries are black magic. Pandas, nympy, matplotlib, sklearn. I find for these libraries I just kind of have to memorize more than understand. I did not even know about this feature of the regex parameter until I saw this question and tested it for myself. The inplace issue though, I recognized at once. A very common Pandas trap. – BlivetWidget May 19 '21 at 17:53
I hadn't worked with the `regex` either! That's why I suggested the vectorized element-wise function given by `df.str.replace()` (which cannot do an inplace replacement, so needs reassignment) – Pranav Hosangadi May 19 '21 at 17:56
I tried this: df['Title'] = df['Title'].replace('\n', '', regex=True, inplace=True) but all the headlines are converted to None – Thanasis Souliotis May 20 '21 at 07:49
You’re mixing solutions, you have to do one or the other. If you use the inplace parameter, do not reassign the data frame. – BlivetWidget May 20 '21 at 11:01
I'd just like to add that the vectorized `df.str.replace()` is noticeably faster (at least 4x for all my attempts) than the regular `df.replace()` with regex. – Pranav Hosangadi May 20 '21 at 14:44
Regex pattern matching is unsurprisingly more computationally intensive than a straight up string operation. My intuitive approach would have been a dataframe .apply() operation with a lambda function using replace (less memorizing of library-specific operations). But to me, that wasn't the question. The question was "What am I doing wrong?" not "What is the computationally ideal solution to this problem?" What they were doing wrong was not using the inplace parameter. – BlivetWidget May 20 '21 at 15:28

Pranav Hosangadi · Answer 2 · 2021-05-19T16:46:44.033

You're looking for

df['Title'].str.replace('\n', '')

Also remember that this replacement doesn't happen in-place. To change the original dataframe, you're going to have to do

df['Title'] = df['Title'].str.replace('\n', '')

df.str provides vectorized string functions to operate on each value in the column. df.str.replace('\n', '') runs the str.replace() function on each element of df.

df.replace() replaces entire values in the column with the given replacement.

For example,

data = [{"x": "hello\n"}, {"x": "yello\n"}, {"x": "jello\n"}]
df = pd.DataFrame(data)

# df: 
#          x
# 0  hello\n
# 1  yello\n
# 2  jello\n

df["x"].str.replace('\n', '')

# df["x"]:
# 0    hello
# 1    yello
# 2    jello

df["x"].replace('yello\n', 'bello\n')

# df["x"]: 
# 0    hello\n
# 1    bello\n
# 2    jello\n

Removing /N character from a column in Python Dataframe

2 Answers2

Linked