Iteratively combine text in first column with existing text in other columns

Question

I am in the process of creating a python script that extracts data from a poorly designed output file (which I can't change) from a piece of equipment within our research lab. I would like to include a way to iteratively combine the text in the first column of a dataframe (example below) with each other column in the dataframe.

A simple example of the dataframe:

Filename	1	2	3	4	5
a	Sheet(1)	Sheet(2)	Sheet(3)	Sheet(4)	....
b	Sheet(1)	Sheet(2)	--------	--------	....
c	Sheet(1)	Sheet(2)	Sheet(3)	Sheet(4)	....
d	Sheet(1)	Sheet(2)	Sheet(3)	--------	....
e	Sheet(1)	Sheet(2)	Sheet(3)	Sheet(4)	....
f	Sheet(1)	--------	--------	--------	....

What I am looking to produce:

Filename	1	2	3	4	5
a	a_Sheet(1)	a_Sheet(2)	a_Sheet(3)	a_Sheet(4)	....
b	b_Sheet(1)	b_Sheet(2)	--------	--------	....
c	c_Sheet(1)	c_Sheet(2)	c_Sheet(3)	c_Sheet(4)	....
d	d_Sheet(1)	d_Sheet(2)	d_Sheet(3)	--------	....
e	e_Sheet(1)	e_Sheet(2)	e_Sheet(3)	e_Sheet(4)	....
f	f_Sheet(1)	--------	--------	--------	....

Trenton McKinney · Answer 1 · 2021-01-18T07:20:01.263

Use .apply to prepend the 'Filename' string to the other columns.
Of the current answers, the solution from Mykola Zotko is the fastest solution, tested against a 3 column dataframe with 100k rows.
If your dataframe has, undesired strings (e.g. '--------'), then use something like df.replace('--------', pd.NA, inplace=True), before combining the column strings.
- If the final result must have '--------', then use df.fillna('--------', inplace=True) at the end. This will be better than trying to iteratively deal with them.

import pandas as pd
import numpy as np

# test dataframe
df = pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2', np.nan, 's2']})

# display(df)
  Filename  c1   c2
0        a  s1   s2
1        b  s1  NaN
2        c  s1   s2

# prepend the filename strings to the other columns
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)

# display(df)
  Filename    c1    c2
0        a  a_s1  a_s2
1        b  b_s1   NaN
2        c  c_s1  c_s2

`%%timeit` test against other answers

# test data with 100k rows
df = pd.concat([pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2'] * 3})] * 33333).reset_index(drop=True)

# Solution from Trenton
%%timeit
df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)
[out]:
33.6 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Solution from Mykola
%%timeit
df['Filename'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'c1':]
[out]:
29.6 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Solution from Alex
%%timeit
df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))
[out]:
45.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# iterating the columns in a for-loop
def test(d):
    for cols in d.columns[1:]:
        d[cols]=d['Filename'] + '_' + d[cols]
    return d

%%timeit
test(df)
[out]:
53.8 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Mykola Zotko · Answer 2 · 2021-01-18T07:23:24.897

1

For example, if you have the following data frame:

  col1 col2 col3 col4
0    a    x    y    z
1    b    x    y    z
2    c    x    y  NaN

You can use broadcasting:

df.loc[:, 'col2':] = df['col1'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'col2':]

Result:

  col1 col2 col3 col4
0    a  a_x  a_y  a_z
1    b  b_x  b_y  b_z
2    c  c_x  c_y  NaN

edited Jan 18 '21 at 07:23

answered Jan 18 '21 at 06:37

Mykola Zotko

15,583
3
71
73

Suhas Mucherla · Answer 3 · 2021-01-18T05:21:45.173

0

Try:

for cols in df.loc[:,'1':]:
    df[cols]=df['Filename']+'_'+df[cols]

edited Jan 18 '21 at 05:21

answered Jan 18 '21 at 05:11

Suhas Mucherla

1,383
1
5
17

This produces a TypeError: cannot do slice indexing on Index with these indexers [1] of type int – lake08 Jan 18 '21 at 05:18
@lake08 is your column names of type `int` or `str`? probably `str` . so you can replace 1 with '1' inside `loc` – Suhas Mucherla Jan 18 '21 at 05:21
Thanks for the quick reply, they are str's. I tried replacing 1 with '1' in loc and throws a KeyError: '1'. An example of an actual "filename" from the data set is "Highland Range_Site 5_2021-01-14_08-59-20.xlsx" is this helps. – lake08 Jan 18 '21 at 05:28
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/28103548) – Bijendra Jan 18 '21 at 07:38
@Bijendra: Why do you say that? This appears to be an answer to me. – Jeremy Caney Jan 18 '21 at 08:20
@JeremyCaney. my bad, was reviewing in bulk and would have mistakenly marked it. Thanks – Bijendra Jan 18 '21 at 12:12

Alex · Answer 4 · 2021-01-18T05:43:03.623

I've represented the -------- as np.NaN. You should be able to label these as NaN when you load the file, see nan_values.

This is the dict for the DataFrame:

d = {
    1: [nan, "Sheet(1)", nan],
    2: [nan, "Sheet(2)", nan],
    3: ["Sheet(3)", nan, "Sheet(3)"],
    4: ["Sheet(4)", nan, nan],
    "Filename": ["a", "b", "c"],
}
df = pd.DatFrame(d)

Then we can:

Make a mask of the columns we want to change, everything but Filename

cols = df.columns != "Filename"
# array([ True,  True,  True,  True, False])

Apply a function, which uses Series.str.cat:

df.loc[:, cols] = df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))

this function takes each column specified in cols and concatenates it with the Filename column.

Which produces:

            1           2           3           4 Filename
0         NaN         NaN  a_Sheet(3)  a_Sheet(4)        a
1  b_Sheet(1)  b_Sheet(2)         NaN         NaN        b
2         NaN         NaN  c_Sheet(3)         NaN        c

Iteratively combine text in first column with existing text in other columns

4 Answers4

%%timeit test against other answers

`%%timeit` test against other answers