52

I am a new python afficionado. For R users, there is one function : paste that helps to concatenate two or more variables in a dataframe. It's very useful. For example Suppose that I have this dataframe :

   categorie titre tarifMin  lieu  long   lat   img dateSortie
1      zoo,  Aquar      0.0 Aquar 2.385 48.89 ilo,0           
2      zoo,  Aquar      4.5 Aquar 2.408 48.83 ilo,0           
6      lieu  Jardi      0.0 Jardi 2.320 48.86 ilo,0           
7      lieu  Bois       0.0 Bois  2.455 48.82 ilo,0           
13     espac Canal      0.0 Canal 2.366 48.87 ilo,0           
14     espac Canal     -1.0 Canal 2.384 48.89 ilo,0           
15     parc  Le Ma     20.0 Le Ma 2.353 48.87 ilo,0 

I want to create a new column which uses another column in a dataframe and some text. With R, I do :

> y$thecolThatIWant=ifelse(y$tarifMin!=-1,
+                             paste("Evenement permanent  -->",y$categorie,
+                                   y$titre,"C  partir de",y$tarifMin,"€uros"),
+                             paste("Evenement permanent  -->",y$categorie,
+                                   y$titre,"sans prix indique"))

And the result is :

> y
   categorie titre tarifMin  lieu  long   lat   img dateSortie
1      zoo,  Aquar      0.0 Aquar 2.385 48.89 ilo,0           
2      zoo,  Aquar      4.5 Aquar 2.408 48.83 ilo,0           
6      lieu  Jardi      0.0 Jardi 2.320 48.86 ilo,0           
7      lieu  Bois       0.0 Bois  2.455 48.82 ilo,0           
13     espac Canal      0.0 Canal 2.366 48.87 ilo,0           
14     espac Canal     -1.0 Canal 2.384 48.89 ilo,0           
15     parc  Le Ma     20.0 Le Ma 2.353 48.87 ilo,0           
                                                thecolThatIWant
1  Evenement permanent  --> zoo,  Aquar C  partir de  0.0 €uros
2  Evenement permanent  --> zoo,  Aquar C  partir de  4.5 €uros
6  Evenement permanent  --> lieu  Jardi C  partir de  0.0 €uros
7  Evenement permanent  --> lieu  Bois  C  partir de  0.0 €uros
13 Evenement permanent  --> espac Canal C  partir de  0.0 €uros
14 Evenement permanent  --> espac Canal C  partir de -1.0 €uros
15 Evenement permanent  --> parc  Le Ma C  partir de 20.0 €uros

My question is : How can I do the same thing in Python Pandas or some other module?

What I've tried so far: Well, I'm a very new user. So sorry for my mistake. I try to replicate the example in Python and we suppose that I get something like this

table=pd.read_csv("y.csv",sep=",")
tt= table.loc[:,['categorie','titre','tarifMin','long','lat','lieu']]
table
ategorie    titre   tarifMin    long    lat     lieu
0   zoo,    Aquar   0.0     2.385   48.89   Aquar
1   zoo,    Aquar   4.5     2.408   48.83   Aquar
2   lieu    Jardi   0.0     2.320   48.86   Jardi
3   lieu    Bois    0.0     2.455   48.82   Bois
4   espac   Canal   0.0     2.366   48.87   Canal
5   espac   Canal   -1.0    2.384   48.89   Canal
6   parc    Le Ma   20.0    2.353   48.87   Le Ma

I tried this basically

sc="Even permanent -->" + " "+ tt.titre+" "+tt.lieu
tt['theColThatIWant'] = sc
tt

And I got this

    categorie   titre   tarifMin    long    lat     lieu    theColThatIWant
0   zoo,    Aquar   0.0     2.385   48.89   Aquar   Even permanent --> Aquar Aquar
1   zoo,    Aquar   4.5     2.408   48.83   Aquar   Even permanent --> Aquar Aquar
2   lieu    Jardi   0.0     2.320   48.86   Jardi   Even permanent --> Jardi Jardi
3   lieu    Bois    0.0     2.455   48.82   Bois    Even permanent --> Bois Bois
4   espac   Canal   0.0     2.366   48.87   Canal   Even permanent --> Canal Canal
5   espac   Canal   -1.0    2.384   48.89   Canal   Even permanent --> Canal Canal
6   parc    Le Ma   20.0    2.353   48.87   Le Ma   Even permanent --> Le Ma Le Ma

Now, I suppose that I have to loop with condition if there is no vectorize like in R?

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
GjT
  • 549
  • 1
  • 5
  • 6
  • 1
    There are many ways to do this, but since most of python is not "vectorized" they usually involve iterators or some version of list comprehension. Please share what you've tried so far and why it hasn't worked. – Justin Jan 22 '14 at 19:53
  • 1
    here are some existing recipes (doesn't include paste though): http://pandas.pydata.org/pandas-docs/dev/comparison_with_r.html – Jeff Jan 22 '14 at 20:30

9 Answers9

56

This very much works like Paste command in R: R code:

 words = c("Here", "I","want","to","concatenate","words","using","pipe","delimeter")
 paste(words,collapse="|")

[1]

"Here|I|want|to|concatenate|words|using|pipe|delimeter"

Python:

words = ["Here", "I","want","to","concatenate","words","using","pipe","delimeter"]
"|".join(words)

Result:

'Here|I|want|to|concatenate|words|using|pipe|delimeter'

Shankar
  • 2,890
  • 3
  • 25
  • 40
SAHIL BHANGE
  • 711
  • 5
  • 5
  • 1
    I'm surprised this answer isn't higher up. The join function is such a simple and short implementation. – bart cubrich Nov 06 '19 at 16:53
  • What if you had a number in this list like `i=1`, then `words=[i, ".jpg"]` – mikey Jun 08 '20 at 15:27
  • I think this answer is good when you only need to connect words by one expression (e.g. "_"). However, a little complicated example of other use of paste/paste0 function: paste0(coeff," (",CI_lower, ",", CI_higher,")"), this method won't help it. – Charlotte Deng Jul 28 '20 at 20:47
  • 1
    The main difference is that the R function is vectorized. If `states = [TX, CA, NY]` and `numbers = [1, 2, 3]` then the paste function should return ['TX1', 'CA2', 'NY3']. R's problem is simpler because python has more types to worry about: lists, numpy arrays, pandas Series, so it is not clear what the return type should be if `numbers` is a numpy array and `states` is a Series. In this R complies with pep20's "only one way" directive more than python. – Steven Scott Sep 09 '20 at 14:45
20

Here's a simple implementation that works on lists, and probably other iterables. Warning: it's only been lightly tested, and only in Python 3.5+:

from functools import reduce

def _reduce_concat(x, sep=""):
    return reduce(lambda x, y: str(x) + sep + str(y), x)
        
def paste(*lists, sep=" ", collapse=None):
    result = map(lambda x: _reduce_concat(x, sep=sep), zip(*lists))
    if collapse is not None:
        return _reduce_concat(result, sep=collapse)
    return list(result)

assert paste([1,2,3], [11,12,13], sep=',') == ['1,11', '2,12', '3,13']
assert paste([1,2,3], [11,12,13], sep=',', collapse=";") == '1,11;2,12;3,13'

You can also have some more fun and replicate other functions like paste0:

from functools import partial

paste0 = partial(paste, sep="")

Edit: here's a Repl.it project with type-annotated versions of this code.

shadowtalker
  • 12,529
  • 3
  • 53
  • 96
  • 1
    Thank you! This worked beautifully for me. And it actually executed faster than a list comprehension approach that I had. – paulstey Apr 04 '16 at 14:34
  • 1
    I think you could get a lot more upvotes by making this answer into a comprehensive review of the alternatives in different cases for future readers – Hack-R Jun 24 '18 at 00:52
  • This function is not a great mimic of the R behavior. Here are simple examples of how it can fail: ``` paste("Hello", ["Ben", "Mike"]) # ['H Ben', 'e Mike'] # not what we want. paste(["Hello"], ["Ben", "Mike"]) # ['Hello Ben'] # not what we want. paste("a", ["Ben", "Mike"]) # ['a Ben'] # not what we want. ``` – Tal Galili Aug 06 '22 at 07:04
6

For this particular case, the paste operator in R is closest to Python's format which was added in Python 2.6. It's newer and somewhat more flexible than the older % operator.

For a purely Python-ic answer without using numpy or pandas, here is one way to do it using your original data in the form of a list of lists (this could also have been done as a list of dict, but that seemed more cluttered to me).

# -*- coding: utf-8 -*-
names=['categorie','titre','tarifMin','lieu','long','lat','img','dateSortie']

records=[[
    'zoo',   'Aquar',     0.0,'Aquar',2.385,48.89,'ilo',0],[
    'zoo',   'Aquar',     4.5,'Aquar',2.408,48.83,'ilo',0],[
    'lieu',  'Jardi',     0.0,'Jardi',2.320,48.86,'ilo',0],[
    'lieu',  'Bois',      0.0,'Bois', 2.455,48.82,'ilo',0],[
    'espac', 'Canal',     0.0,'Canal',2.366,48.87,'ilo',0],[
    'espac', 'Canal',    -1.0,'Canal',2.384,48.89,'ilo',0],[
    'parc',  'Le Ma',    20.0,'Le Ma', 2.353,48.87,'ilo',0] ]

def prix(p):
    if (p != -1):
        return 'C  partir de {} €uros'.format(p)
    return 'sans prix indique'

def msg(a):
    return 'Evenement permanent  --> {}, {} {}'.format(a[0],a[1],prix(a[2]))

[m.append(msg(m)) for m in records]

from pprint import pprint

pprint(records)

The result is this:

[['zoo',
  'Aquar',
  0.0,
  'Aquar',
  2.385,
  48.89,
  'ilo',
  0,
  'Evenement permanent  --> zoo, Aquar C  partir de 0.0 \xe2\x82\xacuros'],
 ['zoo',
  'Aquar',
  4.5,
  'Aquar',
  2.408,
  48.83,
  'ilo',
  0,
  'Evenement permanent  --> zoo, Aquar C  partir de 4.5 \xe2\x82\xacuros'],
 ['lieu',
  'Jardi',
  0.0,
  'Jardi',
  2.32,
  48.86,
  'ilo',
  0,
  'Evenement permanent  --> lieu, Jardi C  partir de 0.0 \xe2\x82\xacuros'],
 ['lieu',
  'Bois',
  0.0,
  'Bois',
  2.455,
  48.82,
  'ilo',
  0,
  'Evenement permanent  --> lieu, Bois C  partir de 0.0 \xe2\x82\xacuros'],
 ['espac',
  'Canal',
  0.0,
  'Canal',
  2.366,
  48.87,
  'ilo',
  0,
  'Evenement permanent  --> espac, Canal C  partir de 0.0 \xe2\x82\xacuros'],
 ['espac',
  'Canal',
  -1.0,
  'Canal',
  2.384,
  48.89,
  'ilo',
  0,
  'Evenement permanent  --> espac, Canal sans prix indique'],
 ['parc',
  'Le Ma',
  20.0,
  'Le Ma',
  2.353,
  48.87,
  'ilo',
  0,
  'Evenement permanent  --> parc, Le Ma C  partir de 20.0 \xe2\x82\xacuros']]

Note that although I've defined a list names it isn't actually used. One could define a dictionary with the names of the titles as the key and the field number (starting from 0) as the value, but I didn't bother with this to try to keep the example simple.

The functions prix and msg are fairly simple. The only tricky portion is the list comprehension [m.append(msg(m)) for m in records] which iterates through all of the records, and modifies each to append your new field, created via a call to msg.

Edward
  • 6,964
  • 2
  • 29
  • 55
  • ok. thaaks. It works like that; But I think Python panda's verion of lowtech is more adapted for my uses. – GjT Jan 22 '14 at 22:29
2

my anwser is loosely based on original question, was edited from answer by woles. I would like to illustrate the points:

  • paste is % operator in python
  • using apply you can make new value and assign it to new column

for R folks: there is no ifelse in direct form (but there are ways to nicely replace it).

import numpy as np
import pandas as pd

dates = pd.date_range('20140412',periods=7)
df = pd.DataFrame(np.random.randn(7,4),index=dates,columns=list('ABCD'))
df['categorie'] = ['z', 'z', 'l', 'l', 'e', 'e', 'p']

def apply_to_row(x):
    ret = "this is the value i want: %f" % x['A']
    if x['B'] > 0:
        ret = "no, this one is better: %f" % x['C']
    return ret

df['theColumnIWant'] = df.apply(apply_to_row, axis = 1)
print df
lowtech
  • 2,492
  • 2
  • 22
  • 31
  • great. That it's exactly what I want. But i've some problems when I try to paste more than two elements. It seems thats ```def apply_to_row(x):ret = "this is the value i want: %s" % x['A'] % "euros"``` doesn't work. I am looking for something else and I will share if i sucess – GjT Jan 22 '14 at 22:32
  • @GjT it should be ret = "this is the value i want: %s euros" % x['A'] – lowtech Jan 22 '14 at 23:04
  • numpy.where is equivalent to ifelse in R – xingzhi.sg Aug 23 '14 at 08:18
2
  1. You can trypandas.Series.str.cat

    import pandas as pd
    def paste0(ss,sep=None,na_rep=None,):
        '''Analogy to R paste0'''
        ss = [pd.Series(s) for s in ss]
        ss = [s.astype(str) for s in ss]
        s = ss[0]
        res = s.str.cat(ss[1:],sep=sep,na_rep=na_rep)
        return res
    
    pasteA=paste0
    
  2. Or just sep.join()

    #
    def paste0(ss,sep=None,na_rep=None, 
        castF=unicode, ##### many languages dont work well with str
    ):
        if sep is None:
            sep=''
        res = [castF(sep).join(castF(s) for s in x) for x in zip(*ss)]
        return res
    pasteB = paste0
    
    
    %timeit pasteA([range(1000),range(1000,0,-1)],sep='_')
    # 100 loops, best of 3: 7.11 ms per loop
    %timeit pasteB([range(1000),range(1000,0,-1)],sep='_')
    # 100 loops, best of 3: 2.24 ms per loop
    
  3. I have used itertools to mimic recycling

    import itertools
    def paste0(ss,sep=None,na_rep=None,castF=unicode):
        '''Analogy to R paste0
        '''
        if sep is None:
            sep=u''
        L = max([len(e) for e in ss])
        it = itertools.izip(*[itertools.cycle(e) for e in ss])
        res = [castF(sep).join(castF(s) for s in next(it) ) for i in range(L)]
        # res = pd.Series(res)
        return res
    
  4. patsy might be relevant (not an experienced user myself.)

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
shouldsee
  • 434
  • 7
  • 7
1

Let's try things with apply.

df.apply( lambda x: str( x.loc[ desired_col ] ) + "pasting?" , axis = 1 )

you will recevied things similar like paste

slfan
  • 8,950
  • 115
  • 65
  • 78
胡亦朗
  • 397
  • 1
  • 3
  • 9
1

If you want to just paste two string columns together, you can simplify @shouldsee's answer because you don't need to create the function. E.g., in my case:

df['newcol'] = df['id_part_one'].str.cat(df['id_part_two'], sep='_')

It might be required for both Series to be of dtype object in order to this (I haven't verified).

Corey Levinson
  • 1,553
  • 17
  • 25
0

This is simple example how to achive that (If I'am not worng what do you want to do):

import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
for row in df.itertuples():
    index, A, B, C, D = row
    print '%s Evenement permanent  --> %s , next data %s' % (index, A, B)

Output:

>>>df
                   A         B         C         D
2013-01-01 -0.400550 -0.204032 -0.954237  0.019025
2013-01-02  0.509040 -0.611699  1.065862  0.034486
2013-01-03  0.366230  0.805068 -0.144129 -0.912942
2013-01-04  1.381278 -1.783794  0.835435 -0.140371
2013-01-05  1.140866  2.755003 -0.940519 -2.425671
2013-01-06 -0.610569 -0.282952  0.111293 -0.108521

This what loop for print: 2013-01-01 00:00:00 Evenement permanent --> -0.400550121168 , next data -0.204032344442

2013-01-02 00:00:00 Evenement permanent  --> 0.509040318928 , next data -0.611698560541

2013-01-03 00:00:00 Evenement permanent  --> 0.366230438863 , next data 0.805067758304

2013-01-04 00:00:00 Evenement permanent  --> 1.38127775713 , next data -1.78379439485

2013-01-05 00:00:00 Evenement permanent  --> 1.14086631509 , next data 2.75500268167

2013-01-06 00:00:00 Evenement permanent  --> -0.610568516983 , next data -0.282952162792
Michał
  • 286
  • 1
  • 3
  • 9
  • I don't have an index in my dataframe. I've tried but it doesn't work. In addition of that, I have a condition on a value in one variable to display different value. But, thanks – GjT Jan 22 '14 at 21:08
0

There is actually a very easy way. You just convert your variable to a string. For instance, try to run this:

a = 1; b = "you are number " + str(a); b
pmadhu
  • 3,373
  • 2
  • 11
  • 23
Roxy
  • 1
  • 1