Adding Multiple Columns at Specific Locations in CSV file using Pandas

Question

I am trying to place multiple columns (Score1, Score2, Score3 etc) before columns whose name begins with a certain text e.g.: Certainty.

I can insert columns at fixed locations using:

df.insert(17, "Score1", " ")

Adding a column then changes the column sequence, so then I would have to look and see where the next column is located. I can add a list of blank columns to the end of a CSV.

So essentially, my understanding is that I have to get pandas to read the column header. If the header text starts with "Certainty", then place a column called Score1 before it.

I tried using:

df.insert(df.filter(regex='Certainty').columns, "Score", " ")

However, as can be guessed it doesn't work.

From what I understand is that pandas is not efficient at iterative methods? Am I misinformed here?

Writing this also leads me to think that it needs a counter for Score1, 2, 3.

Any suggestions would be appreciated!

Thanks in advance.

Updates------Based on feedback provided

Using the method by @SergeBallesta works.

cur=0
for i, col in enumerate(df.columns):
    if col.startswith('Certainty'):
        df.insert(i+cur, f'Score{cur + 1}', '')
        cur += 1

Using the method by @JacoSolari I needed to make a modification to allow it to find all columns starting with "Certainty". And also needed to add Score1, Score2, Score3 automatically.

Version 1: This only adds Score1 in the correct place and then nothing else

counter=0
certcol = df.columns[df.columns.str.contains('Certainty')]
col_idx = df.columns.get_loc(certcol[0])
col_names = [f'Score{counter + 1}']
[df.insert(col_idx, col_name, ' ') 
for col_name in col_names[::-1]]

Version 2: This adds Score1 in the correct place and then adds the rest after the first "Certainty" column. So it does not proceed to find the next one. Perhaps it needs a for loop somewhere?

cur=0
certcol = df.columns[df.columns.str.contains('Certainty')]
for col in enumerate(certcol):
    col_idx = df.columns.get_loc(certcol[0])
    df.insert(cur+col_idx, f'Score{cur + 1}', '')
    cur += 1

I have posted this, in case anyone stumbles across the same need.

You can also put this line `col_idx = df.columns.get_loc(certcol[0])` out of the loop. You don't need to define that at every iteration (in fact it does not depend on the iterator). Also please don't forget to upvote or accept answers if you found them useful :) — JacoSolari, May 20 '20 at 08:38
@JacoSolari, when I place the `col_idx` out of the loop. It just puts all the iterations of Score e.g.: Score1, Score2, Score3,...... after the first "Certainty" column. So it doesn't do: Score1, Certainty1, Some column, Score2, Certainty2, another column, Score3, Certainty3 rather it does Score1, Score2, Score3, Certainty1, Some column, Certainty2, another column, Certainty3 I will keep persisting :) I have also upvoted and accepted. — astro, May 20 '20 at 15:18

score 0 · Accepted Answer · answered May 19 '20 at 10:22

0

You will have to iterate over the columns. It is not as performant as numpy vectorized accesses but sometimes you have no other choice:

Here I would just do:

cur = 0
for i, col in enumerate(df.columns):
    if col.startswith('Certainty')
        df.insert(i+cur, f'Score{cur + 1}', '')
        cur += 1

answered May 19 '20 at 10:22

Serge Ballesta

143,923
11
122
252

This would be very inefficient if your data frame contains a very large number of columns as it will have to iterate and check the `if` condition for all the columns. Better to find the index of the column of interest and add the new columns there. Check my answer, it only iterates on the new columns while the solution you proposed iterates over all the columns in the original data frame. – JacoSolari May 19 '20 at 10:29
This worked well thanks, only needed to add a colon at the end of the `if` statement. I am also trying to implement @JacoSolari method for larger datasets. – astro May 20 '20 at 04:53

score 0 · Answer 2 · answered May 19 '20 at 10:26

0

You can find the location of your Certainty column like this

col_idx = df.columns.get_loc('Certainty')

Then you can add in a for loop each of your new columns and data (here just empty string as in your example) like this

col_names = ['1', '2', '3']
[df.insert(col_idx, col_name, '') for col_name in col_names[::-1]]

So you don't need to update the column index as long as you add the reversed ([::-1]) list of new columns. Also have a look at this question if you didn't already.

answered May 19 '20 at 10:26

JacoSolari

1,226
14
28

Thanks your suggestion works if all the columns were only called 'Certainty' I have modified it to find any columns that begin with 'Certainty'. I will post my variations soon. – astro May 20 '20 at 05:02

Adding Multiple Columns at Specific Locations in CSV file using Pandas

2 Answers2