3

I am trying to get rid of all \n from a whole pandas dataframe. I know there already are answers on this on stack overflow but for some reasons I cant get the desired output. I have the following dataframe:

  title     text    date    authors
0   [ECB completes foreign reserves investment in ...   [\nThe European Central Bank (ECB) completed an ...     [13 June 2017]  ECB
1   [Measures to improve the efficiency of the ope...   [\nThe Governing Council of the ECB has decided ...     [\n 23 January 2003 \n ]    ECB
2   []  []  []  ECB
3   [ECB publishes the results of the Euro Money M...   [Today the European Central Bank (ECB) is publ...   [\n 28 September 2012 \n ]  ECB
4   []  []  []  ECB

This is my desired output:

title   text    date    authors
0   [ECB completes foreign reserves investment in...    [The European Central Bank (ECB) completed an ...   [13 June 2017]  ECB
1   [Measures to improve the efficiency of the ope...   [The Governing Council of the ECB has decided ...   [23 January 2003]   ECB
2   []  []  []  ECB
3   [ECB publishes the results of the Euro Money M...   [Today the European Central Bank (ECB) is publ...   [28 September 2012]     ECB
4   []  []  []  ECB 

These are all codes I tried:

  1. based on this stack overflow post I tried:

    mydf=df.replace({r'\\n': ''}, regex=True)
    
    mydf=df['date'].str.strip(r'\\n') #this turns every obs into NaN 
    
    mydf=df.replace(to_replace=[r"\\n", "\n"], value=["",""], regex=True, inplace =True) #this gets rid of all data in dataframe for some reason
    

neither of which has worked

  1. based on this post I tried (note I am skipping answers which were already tried previously):

    mydf=df.replace(r'\s', '', regex = True, inplace = True) #this deleted all data

  2. based on this post I tried:

    mydf=df.replace('\\n',' ')

  3. based on comments for this post I tried:

    mydf=df['date'].replace(r'\s+|\\n', ' ', regex=True, inplace=True) and

    mydf=df.replace(r'\s+|\\n', ' ', regex=True, inplace=True)

  4. based on answers in this post I tried:

    mydf= df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n', ' ', regex=True)

    mydf=df.replace({ r'\A\s+|\s+\Z': '', '\n' : ' '}, regex=True, inplace=True) # this again deleted whole df

I dont understand why the answers found there are not working in my case since they were accepted and most of those questions seem to be very similar to mine.

petezurich
  • 9,280
  • 9
  • 43
  • 57
1muflon1
  • 209
  • 1
  • 7
  • can you provide a sample df that we can copy into our browsers? – Umar.H Jan 25 '20 at 21:41
  • @Datanovice Sure, should I export the df and upload it somewhere or is there code in python that can give me output that I can copy paste here? – 1muflon1 Jan 25 '20 at 21:48

2 Answers2

2

Try:

df['date']=df['date'].str[0].str.replace(r"\n", "")

That is under the assumption, that each cell in date column is a list with only 1 element. It will also flatten it - so you will get string from that single element.

IF however date can contain more than one element, and you want to merge them all into single string after you get rid of all \n - try

df['date']=df['date'].str.join('').str.replace(r"\n", "")

Otherwise, if you wish to keep it in the list format, just stripping all elements of \n try (&& being interim separator):

df['date']=df['date'].str.join(r'&&').str.replace(r"\n", "").str.split(r'&&')
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • 1
    Thanks this worked! By the way I am curios about .str[0], usually when you examine lists list[0] gives you the first entry in the list, does str[0] in data frame refers to the whole column? I am asking because I am bit puzzled about it from the first look I would think that would correct only the first date not all of them – 1muflon1 Jan 26 '20 at 10:59
  • 2
    Glad to hear that :) ```.str[0]``` is the way to get the first item of the list, but as a vectorized function i.e. if you have ```pandas.Series``` with each row having an iterable in it, it will return first element for each row's iterable. – Grzegorz Skibinski Jan 26 '20 at 11:06
  • 1
    oh got it, I though you always need some kind of for loop for that, good to know. Thanks for all the help! – 1muflon1 Jan 26 '20 at 11:20
1
d = {'col1': [['\n a b c'], ['\n x y z']], 'col2': [[1.5000], ['\n x y z']]}
df20 = pd.DataFrame(data=d)

print(df20)

def remove_spec_char(string_list=list):
    y = []
    for string_x in string_list:
        if type(string_x) == str:
            y.append(string_x.replace('\n', ''))
        else:
            y.append(string_x)
    return y



for c in df20.columns:

    df20[c] = df20[c].apply(remove_spec_char)

print(df20)
  • I tried this: `def remove_spec_char(string_x=str): return string_x.replace('\n', '') for c in df.columns: df[c] = df[c].apply(remove_spec_char) print(df)` but I get the following error: AttributeError: 'list' object has no attribute 'replace' – 1muflon1 Jan 25 '20 at 21:54
  • please check the above code, edited to process lists – Bhosale Shrikant Jan 25 '20 at 22:05
  • I tried this but now it thinks the object is a float. I get an error for the `df[c] = df[c].apply(remove_spec_char)` TypeError: 'float' object is not iterable. I dont get whats happening because all columns in dataframe contain texts so they should be strings – 1muflon1 Jan 25 '20 at 22:27
  • @BhosaleShirkant the code runs without error but it does not produce desired output. The Grzegorz Skibinski answer worked for me so I dont need help anymore but I give you +1 because I appreciate all the effort you put into your answer – 1muflon1 Jan 26 '20 at 10:54