-1

I'm trying to extract every skill from job_skills to be attribute and encoding it by zero or one , how i can do that ?

note : im trying to create a data frame but its not worth to fill the data frame manually (the code is below) , im search for method to extract a list from the column . i need to apply ML algorithms on this data

data = [['a', ['Python', 'UI',' Information Technology (IT)','Software Development','GTK','English',' Software Engineering']],
        ['b', ['Python', 'Relational Databases',' Celery',' VMWare','Django','Continous Integration',' Test Driven Development',' HTTP']],
        ['c', ['Flask', 'Python',' Celery',' Software Development',' Computer Science','Information Technology (IT)']],
        ['c', ['Flask', 'Python',' Celery',' Software Development',' Computer Science','Information Technology (IT)']]
        
        
        ]
df1= pd.DataFrame(data, columns=['col1', 'col2'])


pd.get_dummies(df1['col2'].explode()).groupby(level=0).sum()

jmoerdyk
  • 5,544
  • 7
  • 38
  • 49
sas
  • 1
  • 2
  • I told you everything, but I don't know what you're asking – Panda Kim Nov 06 '22 at 13:14
  • Are you talking about not being able to make a list in `job skills` column? – Panda Kim Nov 06 '22 at 13:27
  • 1
    Please don't vandalize your posts. By posting on the Stack Exchange network, you've granted a non-revocable right, under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license, for Stack Exchange to distribute that content (i.e. regardless of your future choices). By Stack Exchange policy, the non-vandalized version of the post is the one which is distributed, and thus, any vandalism will be reverted. If you want to know more about deleting a post please see: [How does deleting work?](https://stackoverflow.com/help/what-to-do-instead-of-deleting-question). – jmoerdyk Mar 14 '23 at 16:48

3 Answers3

1

I can't think of anything out of the pandas box that will do this straight off. If I understand you want one hot variables for each skill for each person (row). Have you got a unique identifier for each job. If not you need one. In the example below I use the row.

skills = []

row = []



for index, row in df.iterrows():
     for item in row['jobs_skills']:
           row.append(row)
           skills.append(item)

df = pd.DataFrame({'row': row, 'skills': skills})
 

Once you have df you can follow the same logic here:

How can I one hot encode in Python?

If you need the data on the original df then join/merge after that.

Chris
  • 182
  • 1
  • 1
  • 9
0

example:

data = [['a', "['Python', 'UI']"],
        ['b', "['Python', 'Celery']"],
        ['c', "['Flask', 'Python']"],
        ['c', "['Flask', 'Python']"]]
df1= pd.DataFrame(data, columns=['col1', 'col2'])
df1

output:

    col1    col2
0   a   ['Python', 'UI']
1   b   ['Python', 'Celery']
2   c   ['Flask', 'Python']
3   c   ['Flask', 'Python']

col2 is not list but string



df1['col2'].apply(lambda x: x[2:-2].split("', '"))

output:

0        [Python, UI]
1    [Python, Celery]
2     [Flask, Python]
3     [Flask, Python]
Name: col2, dtype: object

now you can make col2 to list

then you can use following code:

df1['col2'] = df1['col2'].apply(lambda x: x[2:-2].split("', '"))
pd.get_dummies(df1['col2'].explode()).groupby(level=0).sum()
Panda Kim
  • 6,246
  • 2
  • 12
0

Here is a proposition using standard pandas dataframe functions :

def create_dummies(df, col):
    dummies = pd.get_dummies(df[col])
    df[dummies.columns] = dummies
    return df

out = (
        df.assign(skill= df["job_skills"].str.strip("[]")
                                         .str.replace("'", "")
                                         .str.split(","))
          .explode("skill")
          .pipe(create_dummies, 'skill')
          .iloc[:, 5:]
          .groupby(level=0)
          .sum()

      )

# Output :

display(out)

enter image description here

# Input used:

print(df.to_string())

    job_title    company    location                                                                                             job_skills
0   Python Or    ItsTime   Oakville,          ['Python', 'UI', 'Computer Science', '. Information Technology (IT)', 'Software Development']
1  Senior Pyt   CLOUDSIG  Sofia, Bul         ['Python3', 'Relational Databases', '. Celery', 'VMWare', '. Django',' Continous Integration']
2  Flask Pyth  Cyber sec  Cairo, Egy  ['Flask', 'Python', '. Software Development', '. Computer Science', '. Information Technology (IT)']
Timeless
  • 22,580
  • 4
  • 12
  • 30