1

I'm just learning Python and for my first project I am trying to re-format a excel table that I can use on GIS. The Table have many columns with x for each corresponding records. I need to assign (replace the x) with the column names and concatenate all rows separated by commas. I was told that Pandas is a very good library to accomplish this. I did started (see sample code) but I am not sure what to do next. Any help or suggestions will be greatly appreciated. Here is a visual representation of what I am trying to accomplish:

enter image description here

Sample Code:

import pandas as pd

input_excel = r"C:\Projects\... Habitat_table.xlsx" # excel sheet path
excel = pd.read_excel(input_excel, sheet_name = 'Species_habitat') # sheet name
final_dataframe = pd.DataFrame (excel, columns=[‘Habitat_A, ‘Habitat B,C,&D’, ‘Habitat_E']) # every single column name
                               
habitats = [‘Habitat_A, ‘Habitat B,C,&D’, ‘Habitat_E']
                                                                  

for index, row in final_dataframe.iterrows():
    final_string = " "
            print (final_dataframe.columns.name)
for h in habitats:
            print(h)
    for c in index:
        if h in index.name: #checks if habitat is in column name
           print(h)
    if row[c] is not null:
            final_string == final_string + c.name + ", "
           print(final_string)
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • What is the expected output? Can you also provide the dataset? – mozway Aug 10 '21 at 15:41
  • The expected output is a table like the second table (orange heading) show in the attached image. Basically, replacing the x with column names and concatenate the columns into a single column separate by commas. (see attached image) – user16633576 Aug 10 '21 at 15:56
  • @user16633576 Please provide data as text, we can not copy the content from image. For expected output, image is acceptable. Please take a look at [How to ask](https://stackoverflow.com/help/how-to-ask) and [How to make good pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – ThePyGuy Aug 10 '21 at 15:57
  • 1
    Similar to [Reversing 'one-hot' encoding in Pandas](https://stackoverflow.com/q/38334296/1288) – Bill the Lizard Aug 10 '21 at 16:37
  • Sorry...I'm not sure how to upload a csv table to the current active question??? – user16633576 Aug 10 '21 at 17:20

2 Answers2

2
data_dict = {
    'Species_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Habitat_A': ['X', '', 'X', 'X', '', 'X', 'X', '', 'X', ''],
    'Habitat B,C,&D': ['X', '', 'X', 'X', '', 'X', 'X', '', 'X', ''],
    'Habitat_E': ['', 'X', '', 'X', 'X', '', 'X', 'X', '', 'X'],
}

df = pd.DataFrame.from_dict(data_dict)
df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: pd.Series([y[0] if y[1] == 'X' else '' for y in x.iteritems()]), axis=1)
df['All_habitats(CONCAT)'] = df.apply(lambda x: ','.join(filter(None, x[1:])), axis=1)
print(df)

Prints:

   Species_ID  Habitat_A  Habitat B,C,&D  Habitat_E                All_habitats(CONCAT)
0           1  Habitat_A  Habitat B,C,&D                       Habitat_A,Habitat B,C,&D
1           2                             Habitat_E                           Habitat_E
2           3  Habitat_A  Habitat B,C,&D                       Habitat_A,Habitat B,C,&D
3           4  Habitat_A  Habitat B,C,&D  Habitat_E  Habitat_A,Habitat B,C,&D,Habitat_E
4           5                             Habitat_E                           Habitat_E
5           6  Habitat_A  Habitat B,C,&D                       Habitat_A,Habitat B,C,&D
6           7  Habitat_A  Habitat B,C,&D  Habitat_E  Habitat_A,Habitat B,C,&D,Habitat_E
7           8                             Habitat_E                           Habitat_E
8           9  Habitat_A  Habitat B,C,&D                       Habitat_A,Habitat B,C,&D
9          10                             Habitat_E                           Habitat_E

Test on 2095 rows * 19 columns from .csv (dummy data)

import pandas as pd, time

tic = time.perf_counter()
df = pd.read_csv(r'c:\Users\Alex20\Documents\Habitats.csv')

df.iloc[:, 1:] = df.iloc[:, 1:].apply(lambda x: pd.Series([y[0] if y[1] == 'X' else '' for y in x.iteritems()]), axis=1)
df['All_habitats(CONCAT)'] = df.apply(lambda x: ','.join(filter(None, x[1:])), axis=1)
print(df)

print(f"Processed in {time.perf_counter() - tic:0.4f} seconds")

Output:

      Species_ID  ...                               All_habitats(CONCAT)
0              1  ...  HabitatA,HabitatB,HabitatC,HabitatD,HabitatF,H...
1              2  ...                         HabitatC,HabitatG,HabitatP
2              3  ...  HabitatA,HabitatB,HabitatC,HabitatE,HabitatG,H...
3              4  ...  HabitatA,HabitatB,HabitatE,HabitatJ,HabitatL,H...
4              5  ...  HabitatD,HabitatI,HabitatK,HabitatL,HabitatM,H...
...          ...  ...                                                ...
2090        2091  ...  HabitatA,HabitatB,HabitatE,HabitatF,HabitatG,H...
2091        2092  ...  HabitatA,HabitatB,HabitatC,HabitatE,HabitatF,H...
2092        2093  ...  HabitatB,HabitatC,HabitatD,HabitatG,HabitatH,H...
2093        2094  ...  HabitatC,HabitatF,HabitatG,HabitatI,HabitatK,H...
2094        2095  ...  HabitatB,HabitatE,HabitatG,HabitatI,HabitatK,H...

[2095 rows x 19 columns]
Processed in 0.4257 seconds

.csv
enter image description here

Алексей Р
  • 7,507
  • 2
  • 7
  • 18
  • Thank you for your suggestion. However , this method would not work for me. The actual excel table has 18 columns for the habitats and 2,096 rows for the species ID. – user16633576 Aug 10 '21 at 16:41
  • @user16633576 1) why? 2) see the test in the solution, .csv 2095 rows * 19 columns processed in 0.4257 seconds – Алексей Р Aug 10 '21 at 19:00
  • Very nice!...Thank you...I wish I understand the code better...I have no idea how to use this code with the excel table! I'm just a beginner with python! – user16633576 Aug 10 '21 at 19:26
0

Here is a solution...

import pandas as pd
import numpy as np

data = {'species_id':[1,2,3],
            'Habitat_A':['x',np.nan,'x'],
            'Habitat_B':['x', np.nan,np.nan],
        'Habitat_C':['x', 'x',np.nan],
    }
    
df = pd.DataFrame(data)
display(df)
for col in df.columns:
    df[col] = df[col].replace('x', col+ ' ')
    df[col] = df[col].replace(np.nan, '')
df['All_Habitats']= df['Habitat_A'].astype(str) + df['Habitat_B'].astype(str)  + df['Habitat_C'].astype(str)
display(df)

Output:

    species_id  Habitat_A   Habitat_B   Habitat_C   All_Habitats
0   1           Habitat_A   Habitat_B   Habitat_C   Habitat_A Habitat_B Habitat_C
1   2                                   Habitat_C   Habitat_C
2   3           Habitat_A                           Habitat_A
  • Thank you for your suggestion. However , this method would not work for me. The actual excel table has 18 columns for the habitats and 2,096 rows for the species ID. – user16633576 Aug 10 '21 at 16:42
  • The code to replace the 'x' value would still work for any number of columns. You could then save to excel using ```df.to_excel('filename.xlsx')```to use the ```=CONCAT()``` Function in Excel to get the ```All_Habitats``` column. This isn't that much work to do in Excel, just select each column one at a time and do a search for X then replace. – Caleb Sprague Aug 10 '21 at 17:09
  • YES..you are right!. I can do this in excel very easy. I just want to practice python since I will be doing this many times and some tables have more than 70 columns and 1000's of records. I am not sure how to use your code without entering the all the excel records ( 2, 096 ) manually to create the data series at the top??? – user16633576 Aug 10 '21 at 17:18
  • ```import pandas as pd ``` then ```df = pd.read_excel('path/to/your/excelfile.xlsx')``` and you will have a dataframe of the excel file. Look at the docs here https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html – Caleb Sprague Aug 10 '21 at 17:29
  • That is what I did (I think) on my attempt ( see my sample code). I imported the ecxel table as dataframe. As I mention some tables have 70 or more columns. Does that mean that I have to manually enter all the columns names at the line---> df ['All_Habitats ] = df [... ] ? – user16633576 Aug 10 '21 at 17:37