0

I'm trying to combine values of multiple columns into a single column. Suppose I have a csv with the following data

col1,col2,col3,col4
1,2,3,4
6,2,4,6
2,5,6,2

I want it to become a single column with the values concatenated separated by a blank space

col1
1 2 3 4
6 2 4 6
2 5 6 2

The number of columns is 2000+ so having the columns statically concatenated will not do.

Andrie
  • 175
  • 1
  • 12
  • Does this answer your question? [Merge multiple column values into one column in python pandas](https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas) – Divyesh patel Nov 19 '19 at 04:01

3 Answers3

1

I have no idea why you would want such a design. But you can aggregate across axis=1

df.astype(str).agg(' '.join, 1).to_frame('col')

       col
0  1 2 3 4
1  6 2 4 6
2  2 5 6 2
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • I have records with 2k+ columns which varies I want to store my records in a database. Before querying to a database id like to concatenate the columns into a single string – Andrie Nov 19 '19 at 04:23
0

I would try using pandas. This will find all of the column names and then concat the values for each row across all columns and save it as a new dataframe.

import pandas as pd
df = pd.read_csv('test.csv')
cols = df.columns
df = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

The output for this csv file

c1,c2,c3
1,2,3
4,5,6
7,8,9

Is

Index(['c1', 'c2', 'c3'], dtype='object')
0    1 2 3
1    4 5 6
2    7 8 9

Where the ['c1', 'c2', 'c3'] is all the column names concatenated.

mpearce25
  • 144
  • 1
  • 2
  • Makes no sense to do `df[df.columns]`- it's redundant, just use `df`. Also, it's better to just make the whole `df.astype(str)` at once rather than do that partially for every series inside the `apply`. Finally, no need to do transform the pandas series into a Numpy array by using `row.values` - just use `row` right away. – rafaelc Nov 19 '19 at 04:50
  • @rafaelc you are right.. I just came across another problem using this solution. It would use col1 as the index. – Andrie Nov 19 '19 at 04:57
0

Setting things up:

import numpy as np
import pandas as pd

#generating random int dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

First case(by hand):

str_df1 = df.iloc[:, 0].apply(str) + " " + df.iloc[:, 1].apply(str) + " " + df.iloc[:, 2].apply(str) + " " + df.iloc[:, 3].apply(str)

Second case(generic):

str2_df = df.iloc[:, 0].apply(str)
for i in range(1, df.shape[1]):
  str2_df += " " + df.iloc[:, i].apply(str)

the actual code here

the results

Hope I have helped.

  • Even though this might work, it'll be pretty slow. Avoid looping through a `df` and accessing each row using `iloc[:, I]` - that goes against why the library was built in the first place (no loops!) – rafaelc Nov 19 '19 at 04:51