1

I have a dataframe that looks like the following:

Company               keywords

A                     SOFTWARE, IOT, PLATFORM, ENERGY, OPEN SOURCE
B                     ENERGY, PUBLIC UTILITIES, HARDWARE, SOFTWARE
C                     ENERGY, SOFTWARE, ELECTROMOBILITY, EMISSIONS
D                     HARDWARE, DATA, API, SOFTWARE, DATA PLATFORM
E                     ENERGY, SOFTWARE, ELECTROMOBILITY, DATA

I would like to create two separate dataframe 1-with the keyword 'SOFTWARE' without the keyword 'HARDWARE' 2 The combination of both i.e 'SOFTWARE' and 'HARDWARE'

The desired output should look like the following:

df_software
Company               keywords

A                     SOFTWARE, IOT, PLATFORM, ENERGY, OPEN SOURCE
C                     ENERGY, SOFTWARE, ELECTROMOBILITY, EMISSIONS
E                     ENERGY, SOFTWARE, ELECTROMOBILITY, DATA

df_software_hardware

   

B                     ENERGY, PUBLIC UTILITIES, HARDWARE, SOFTWARE
D                     HARDWARE, DATA, API, SOFTWARE, DATA PLATFORM

I can easily find

df_software=df[df['Keywords'].str.contains('(SOFTWARE)')] 

but it also give rows with 'HARDWARE' entries.

Thanks in advance.

2 Answers2

1

Try:

import numpy as np

# Boolean indices of rows including word SOFTWARE
ind_df_software=df["keywords"].str.contains("SOFTWARE")

# Boolean indices of rows including word HARDWARE
ind_df_hardware=df["keywords"].str.contains("HARDWARE")

df_software=df.loc[np.logical_and(ind_df_software, ~ind_df_hardware)]
df_software_hardware=df.loc[np.logical_and(ind_df_software, ind_df_hardware)]

Outputs:

>>> df_software

  Company                                      keywords
0       A  SOFTWARE, IOT, PLATFORM, ENERGY, OPEN SOURCE
2       C  ENERGY, SOFTWARE, ELECTROMOBILITY, EMISSIONS
4       E       ENERGY, SOFTWARE, ELECTROMOBILITY, DATA

>>> df_software_hardware

  Company                                      keywords
1       B  ENERGY, PUBLIC UTILITIES, HARDWARE, SOFTWARE
3       D  HARDWARE, DATA, API, SOFTWARE, DATA PLATFORM
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
0

Try this:

df_software=df[~df['Keywords'].str.contains('(HARDWARE)') & df['Keywords'].str.contains('(SOFTWARE)')] 
gtomer
  • 5,643
  • 1
  • 10
  • 21