1

I am preparing dataset for openai model training. For an example, I have data in csv in below format

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})

df
    bar foo new 
0   1   a   apple   
1   2   b   banana  
2   3   c   pear

I want to create a new column called 'prompt' with combine values of bar and foo columns

    bar foo new     prompt  
0   1   a   apple   foo: a, bar: 1
1   2   b   banana  foo: b, bar: 2
2   3   c   pear    foo:c, bar: 3

there is similar example here but it doesn't add the column names inside the combined column

tdy
  • 36,675
  • 19
  • 86
  • 83
g1soori
  • 13
  • 2
  • You want to make an object from the columns? {col1: value1, col2: value2, col3: value3}? – dasfacc Jan 17 '23 at 21:39
  • @dasfacc not exactly a object but string in the same format with curly brackets. `{col1: value1, col2: value2, col3: value3}` – g1soori Jan 17 '23 at 22:21
  • @g1soori Your comment has curly braces but your question text doesn't. Which one did you want? – tdy Jan 17 '23 at 22:39

3 Answers3

3

df.apply is very popular but should be avoided whenever possible.

Instead use vectorized methods. Convert the relevant columns to dict and remove the quotes:

df["prompt"] = df[["foo", "bar"]].to_dict(orient="index")
df["prompt"] = df["prompt"].astype(str).replace(r"'", "", regex=True)

#   foo  bar     new            prompt
# 0   a    1   apple  {foo: a, bar: 1}
# 1   b    2  banana  {foo: b, bar: 2}
# 2   c    3    pear  {foo: c, bar: 3}

Note that your comment included braces but your post did not. If you also want to remove the curly braces, add them to the regex:

df["prompt"] = df["prompt"].astype(str).replace(r"[{'}]", "", regex=True)

#   foo  bar     new          prompt
# 0   a    1   apple  foo: a, bar: 1
# 1   b    2  banana  foo: b, bar: 2
# 2   c    3    pear  foo: c, bar: 3

Details

First convert the relevant columns to_dict oriented by index:

df["prompt"] = df[["foo", "bar"]].to_dict(orient="index")

#   foo  bar     new                  prompt
# 0   a    1   apple  {'foo': 'a', 'bar': 1}
# 1   b    2  banana  {'foo': 'b', 'bar': 2}
# 2   c    3    pear  {'foo': 'c', 'bar': 3}

Then use astype to convert it to str type and replace the dict symbols:

df["prompt"] = df["prompt"].astype(str).replace(r"[{'}]", "", regex=True)

#   foo  bar     new          prompt
# 0   a    1   apple  foo: a, bar: 1
# 1   b    2  banana  foo: b, bar: 2
# 2   c    3    pear  foo: c, bar: 3
tdy
  • 36,675
  • 19
  • 86
  • 83
0
df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['prompt'] = df.apply(lambda x: x.to_json(), axis=1)
df

This would work? You get:

    foo bar new     prompt
0   a   1   apple   {"foo":"a","bar":1,"new":"apple"}
1   b   2   banana  {"foo":"b","bar":2,"new":"banana"}
2   c   3   pear    {"foo":"c","bar":3,"new":"pear"}
Vetsin
  • 2,245
  • 1
  • 20
  • 24
  • 1
    Its short and sweet and works for my example. but I prefer the solution from tdy as it can filter out the required columns instead of combining all the columns. thanks for the response. – g1soori Jan 18 '23 at 00:06
0

Combine multiple cells in a row using apply and lambda

df['prompt']=df.apply(lambda k: 'foo: ' + k['foo'] + ', bar: ' + str(k['bar']), axis=1)

This will work?

  foo  bar     new          prompt
0   a    1   apple  foo: a, bar: 1
1   b    2  banana  foo: b, bar: 2
2   c    3    pear  foo: c, bar: 3
Hasnain
  • 21
  • 5
  • 3
    My actual csv file has 50 columns so I don't want to specify each column in the lambda. thanks for the response – g1soori Jan 18 '23 at 00:08