2

I am in the process of creating a python script that extracts data from a poorly designed output file (which I can't change) from a piece of equipment within our research lab. I would like to include a way to iteratively combine the text in the first column of a dataframe (example below) with each other column in the dataframe.

A simple example of the dataframe:

Filename 1 2 3 4 5
a Sheet(1) Sheet(2) Sheet(3) Sheet(4) ....
b Sheet(1) Sheet(2) -------- -------- ....
c Sheet(1) Sheet(2) Sheet(3) Sheet(4) ....
d Sheet(1) Sheet(2) Sheet(3) -------- ....
e Sheet(1) Sheet(2) Sheet(3) Sheet(4) ....
f Sheet(1) -------- -------- -------- ....

What I am looking to produce:

Filename 1 2 3 4 5
a a_Sheet(1) a_Sheet(2) a_Sheet(3) a_Sheet(4) ....
b b_Sheet(1) b_Sheet(2) -------- -------- ....
c c_Sheet(1) c_Sheet(2) c_Sheet(3) c_Sheet(4) ....
d d_Sheet(1) d_Sheet(2) d_Sheet(3) -------- ....
e e_Sheet(1) e_Sheet(2) e_Sheet(3) e_Sheet(4) ....
f f_Sheet(1) -------- -------- -------- ....
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
lake08
  • 23
  • 7

4 Answers4

4
  • Use .apply to prepend the 'Filename' string to the other columns.
  • Of the current answers, the solution from Mykola Zotko is the fastest solution, tested against a 3 column dataframe with 100k rows.
  • If your dataframe has, undesired strings (e.g. '--------'), then use something like df.replace('--------', pd.NA, inplace=True), before combining the column strings.
    • If the final result must have '--------', then use df.fillna('--------', inplace=True) at the end. This will be better than trying to iteratively deal with them.
import pandas as pd
import numpy as np

# test dataframe
df = pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2', np.nan, 's2']})

# display(df)
  Filename  c1   c2
0        a  s1   s2
1        b  s1  NaN
2        c  s1   s2

# prepend the filename strings to the other columns
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)

# display(df)
  Filename    c1    c2
0        a  a_s1  a_s2
1        b  b_s1   NaN
2        c  c_s1  c_s2

%%timeit test against other answers

# test data with 100k rows
df = pd.concat([pd.DataFrame({'Filename': ['a', 'b', 'c'], 'c1': ['s1'] * 3, 'c2': ['s2'] * 3})] * 33333).reset_index(drop=True)

# Solution from Trenton
%%timeit
df.iloc[:, 1:].apply(lambda x: df.Filename + '_' + x)
[out]:
33.6 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Solution from Mykola
%%timeit
df['Filename'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'c1':]
[out]:
29.6 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Solution from Alex
%%timeit
df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))
[out]:
45.3 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# iterating the columns in a for-loop
def test(d):
    for cols in d.columns[1:]:
        d[cols]=d['Filename'] + '_' + d[cols]
    return d

%%timeit
test(df)
[out]:
53.8 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
1

For example, if you have the following data frame:

  col1 col2 col3 col4
0    a    x    y    z
1    b    x    y    z
2    c    x    y  NaN

You can use broadcasting:

df.loc[:, 'col2':] = df['col1'].to_numpy().reshape(-1, 1) + '_' + df.loc[:, 'col2':]

Result:

  col1 col2 col3 col4
0    a  a_x  a_y  a_z
1    b  b_x  b_y  b_z
2    c  c_x  c_y  NaN
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
0

Try:

for cols in df.loc[:,'1':]:
    df[cols]=df['Filename']+'_'+df[cols]
Suhas Mucherla
  • 1,383
  • 1
  • 5
  • 17
  • This produces a TypeError: cannot do slice indexing on Index with these indexers [1] of type int – lake08 Jan 18 '21 at 05:18
  • @lake08 is your column names of type `int` or `str`? probably `str` . so you can replace 1 with '1' inside `loc` – Suhas Mucherla Jan 18 '21 at 05:21
  • Thanks for the quick reply, they are str's. I tried replacing 1 with '1' in loc and throws a KeyError: '1'. An example of an actual "filename" from the data set is "Highland Range_Site 5_2021-01-14_08-59-20.xlsx" is this helps. – lake08 Jan 18 '21 at 05:28
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/28103548) – Bijendra Jan 18 '21 at 07:38
  • @Bijendra: Why do you say that? This appears to be an answer to me. – Jeremy Caney Jan 18 '21 at 08:20
  • @JeremyCaney. my bad, was reviewing in bulk and would have mistakenly marked it. Thanks – Bijendra Jan 18 '21 at 12:12
0

I've represented the -------- as np.NaN. You should be able to label these as NaN when you load the file, see nan_values.

This is the dict for the DataFrame:

d = {
    1: [nan, "Sheet(1)", nan],
    2: [nan, "Sheet(2)", nan],
    3: ["Sheet(3)", nan, "Sheet(3)"],
    4: ["Sheet(4)", nan, nan],
    "Filename": ["a", "b", "c"],
}
df = pd.DatFrame(d)

Then we can:

  • Make a mask of the columns we want to change, everything but Filename
cols = df.columns != "Filename"
# array([ True,  True,  True,  True, False])
df.loc[:, cols] = df.loc[:, cols].apply(lambda s: df["Filename"].str.cat(s, sep="_"))

this function takes each column specified in cols and concatenates it with the Filename column.

Which produces:

            1           2           3           4 Filename
0         NaN         NaN  a_Sheet(3)  a_Sheet(4)        a
1  b_Sheet(1)  b_Sheet(2)         NaN         NaN        b
2         NaN         NaN  c_Sheet(3)         NaN        c
Alex
  • 6,610
  • 3
  • 20
  • 38