1

I'm a python/pandas newbie and have the following problem: I have a list called 'cat' containing different 13 strings that represent categories. I further have a dataframe called 'ku_drop' that contains 10 columns with HTML code (string format), from which I want to extract information. Now, I want to search for each string of my 'cat'-list in the dataframe and save each cell containing the specific string in the same column. (E.g. all cells containing the string 'Arbeitsatmosphäre' should be saved in Column X1, all containing 'Kommunikation' in Column X2 etc.) How can I do this? I tried with the following, but I only receive an empty dataframe ...

cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
cat_length = len(cat)
df_appender = []
for i in range(cat_length):
    x = "{}".format(category[i] for category in cat)
    df_cat = ku_drop[ku_drop.apply(lambda col: col.str.contains(x, case=False), axis=1)].stack().to_frame()
    df_cat.columns = ['X[i]']
    df_cat = df_cat.dropna(axis=0)
    df_appender.append(df_cat)
df_appender

I'm aware that my code might have a lot of flaws, please excuse this as I am really not very familiar with pandas so far.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
Freyana
  • 61
  • 8
  • Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] to provide a sample of your input and your expected output so that we can better understand your task – G. Anderson Mar 23 '22 at 15:39

1 Answers1

1

Try:

cat = ['Arbeitsatmosphäre', 'Kommunikation', 'Kollegenzusammenhalt', 'Work-Life-Balance', 'Vorgesetztenverhalten', 'Interessante Aufgaben', 'Gleichberechtigung', 'Umgang mit älteren Kollegen', 'Arbeitsbedingungen', 'Umwelt-/Sozialbewusstsein', 'Gehalt/Sozialleistungen', 'Image', 'Karriere/Weiterbildung']
ku_drop = pd.DataFrame({'c1': ['Arbeitsatmosphäre abc', 'abc', 'Work-Life-Balance abc', 'Arbeitsatmosphäre abc'], 'c2': ['abc', 'abc Vorgesetztenverhalten abc', 'Kommunikation abc abc', 'abc abc Arbeitsatmosphäre']})

df = pd.DataFrame(index= range(len(ku_drop)), columns = cat)
for i, c in enumerate(cat):
    used = 0
    for j, c2 in enumerate(ku_drop.columns):
        temp = ku_drop[ku_drop[c2].str.contains(c)][c2].values
        if len(temp)>0:
            df.loc[used:used+len(temp)-1,c] = temp
            used += len(temp)

Output:

enter image description here

keramat
  • 4,328
  • 6
  • 25
  • 38
  • Thank you, I only saw this now unfortunately! I can make it work for your example, but not for my 10000 observation dataframe... can you explain to me what the last part (if len(temp)>0 onwards) means? I think this would help me a lot! @keramat – Freyana Mar 31 '22 at 06:19
  • Can you provide the error or problem encountered the code on your data? if there is any match we assign the rows to the end of added rows. The used variable holds the number of used rows. – keramat Mar 31 '22 at 06:48
  • I don't get an error, I just get a dataframe with only NaNs back... I tried to understand why and replaced ...str.contains(c)... with ...str.contains('Arbeitsatmosphäre')... to check whether the extraction works. The problem seems to be, that the respective strings are not saved in column "Arbeitsatmosphäre", but in every column. – Freyana Mar 31 '22 at 07:05
  • Can you provide a sample of your data? – keramat Mar 31 '22 at 07:09
  • sure, how do I do this? – Freyana Mar 31 '22 at 07:13
  • Just upload it somewhere and put the link in the question. – keramat Mar 31 '22 at 07:15
  • Okay, here is the link (I uploaded both the csv-file and the py-file with the code necessary beforehand) : https://1drv.ms/u/s!Arkw7Y7i28Jikyvvs4Z9Ec5rdccR?e=mjO4R8 – Freyana Mar 31 '22 at 07:25
  • 1
    strange, now it works for me too... thanks so much for your time!! – Freyana Mar 31 '22 at 08:00
  • As much as I check it works. Just add df[df.notna().sum(axis=1)>0] to the end of code to remove empty rows. – keramat Mar 31 '22 at 08:00
  • Also, you can use df.dropna(how='all'). – keramat Mar 31 '22 at 08:03