How to compare string value in dataframe column and value in cell to create new dataframe based on multi value dictionary?

Question

I have a df that looks like this:

id.1.value.1        id.2.value.2      id.1.question    id.2.value.2
TRUE                     FALSE             TRUE             TRUE

I want to create logic that scans the column names of the df and extracts the last number only from column names that have value in column name and compare the value in the cell of the column that contains value with following logic:

If value in column value equal TRUE then compare the last number in multi value dictionary
use second value in multi key dictionary to create dataframe column names

Example:

my_dict = {1: ('a', 'category'),2: ('b', 'category'),\
           3: ('c', 'category'),4:('d','category'),\
           5:('e','subcategory'),6:('f','subcategory'),\
           7:('g','subcategory'),8:('h','subcategory'),\
           9:('i','subcategory'),10:('j','subcategory'),\
           11:('k','subcategor'),12:('l','subcategory'),\
           13:('m','subcategory'),14:('n','subcategory'),\
           15:('o','subcategory'),16:('p','subcategory'),\
           17:('q','subcategory'),18:('r','subcategory'),\
           19:('s','subcategory'),20:('t','subcategory'),\
           21:('u','subcategory'),22:('v','subcategory'),\
           23:('w','subcategory'),24:('x','subcategory')

           }

If my current df looks likes this:

id.1.value.1        id.2.value.2      id.1.question    id.6.value.6
    TRUE                 FALSE             TRUE             TRUE

New df should look like this:

category    subcategory
a               f

yes the dataframe is created a from a submission via webhook. Every new submission is a new file — RustyShackleford, Jan 07 '19 at 21:37

score 1 · Accepted Answer · answered Jan 07 '19 at 22:01

1

names = df.columns
new_df = pd.DataFrame()
for name in names:    
    if ('value' in name) & df[name][0]:
        last_number = int(name[-1])
        key, value = my_dict[last_number]
        try:
            new_df[value][0] = list(new_df[value][0]) + [key]
        except:
            new_df[value] = [key]

answered Jan 07 '19 at 22:01

cors

527
4
11

For some reason `name` is pulling up column that doesnt even contain the word `value`. I have tried adding periods after and before `value` in the `IF` statement. – RustyShackleford Jan 08 '19 at 00:48
I confirmed in the `names` the columns with `value` is present – RustyShackleford Jan 08 '19 at 01:01
I think the issue is that the `if` statement is not iterating through the entire list and i think `df[name]` is referencing the scientific notation of the number in the column I want like so `0 1.542210e+12 Name: data.answers.1542213646975.itemId, dtype: float64` I used 1,2,3 as examples – RustyShackleford Jan 08 '19 at 01:03
I changed `names = df.columns` to `names = df.columns.values`. Got error: `TypeError: unsupported operand type(s) for &: 'bool' and 'numpy.float64'` – RustyShackleford Jan 08 '19 at 01:10
Ignore the above error: I changed `names = df.columns` to `names = df.columns.values` and changed `&` to `and`. Which picks up the right column. However `last_number` is only getting the last digit of the entire number. Full column name is: `data.answers.0987654321.value.1234567890` as an example. I want the entire number after `value` to compare in the dict – RustyShackleford Jan 08 '19 at 01:17
changed `if ('value' in name) & df[name][0]: last_number = int(name[-1])` to `if ('value' in name) and df[name][0]: last_number = int(name[-13:])` – RustyShackleford Jan 08 '19 at 01:51

score 0 · Answer 2 · answered Jan 07 '19 at 21:53

0

Where df,

   id.1.value.1  id.2.value.2  id.1.question  id.6.value.6
0          True         False           True          True

Use:

i = df.loc[:,df.columns[df.iloc[0]]].filter(like='value').columns.str.split('.').str[-1].astype(int).tolist()

my_dict = {1: ('a', 'category'),2: ('b', 'category'),\
           3: ('c', 'category'),4:('d','category'),\
           5:('e','subcategory'),6:('f','subcategory'),\
           7:('g','subcategory'),8:('h','subcategory'),\
           9:('i','subcategory'),10:('j','subcategory'),\
           11:('k','subcategor'),12:('l','subcategory'),\
           13:('m','subcategory'),14:('n','subcategory'),\
           15:('o','subcategory'),16:('p','subcategory'),\
           17:('q','subcategory'),18:('r','subcategory'),\
           19:('s','subcategory'),20:('t','subcategory'),\
           21:('u','subcategory'),22:('v','subcategory'),\
           23:('w','subcategory'),24:('x','subcategory')}

df1 = pd.DataFrame.from_dict(my_dict, orient='index')

df_out = df1.loc[i].set_index(1).T

print(df_out)

Output:

1 category subcategory
0        a           f

answered Jan 07 '19 at 21:53

Scott Boston

147,308
15
139
187

for the `i` variable, i get error: `ValueError: cannot index with vector containing NA / NaN values` I am only looking to extract the number after `value.` – RustyShackleford Jan 07 '19 at 21:58
Also how are you comparing the cell value if its `TRUE`? – RustyShackleford Jan 07 '19 at 22:00
Oh... I am using df.iloc[0] the first row of your input data frame. df.columns[df.iloc[0]] is the part where I am only selecting columns with the value of true in the dataframe. – Scott Boston Jan 07 '19 at 22:11
but wouldnt '0'be the first row of the df? if not 0 what should it be? – RustyShackleford Jan 07 '19 at 22:12
I am using iloc which is integer location so integer location numerical 0. – Scott Boston Jan 07 '19 at 22:12
If I dont use 0 what should I use? I tried 1 and -1 – RustyShackleford Jan 08 '19 at 01:02
.iloc[0] works with any dataframe. Index position start at zero, therefore all dataframes with a single row will have .iloc[0] Now, if your True False values are on a different row, then you can use .loc and identify that row by the index label using .loc instead of by integer location with .iloc. Your dataframe index could be 'A' 'B' 'C' and 'D'. df.iloc[0] will still get the first row. – Scott Boston Jan 08 '19 at 03:06

Daniel Scott · Answer 3 · 2019-01-08T01:53:27.703

0

new_df = pd.DataFrame()

# get column names
for col in (list(df)):

    if "value" in col:

        try:
            # operate only in columns where a valid number is found
            value = df[col].rpartition('.')[:-1]

            # When df== True
            if df.loc[col,1]==True:
                new_df[my_dict[value][1]]= my_dict[value][0]
        except Exception as e:
            print(e)

edited Jan 08 '19 at 01:53

answered Jan 07 '19 at 22:02

Daniel Scott

979
7
16

For some reason `name` is pulling up column that doesnt even contain the word `value`. Also did you mean `df` instead of `df_value`? – RustyShackleford Jan 08 '19 at 00:49
Okay, I wasn't sure if the df was called 'df' :P @RustyShackleford could you try out my updated code? – Daniel Scott Jan 08 '19 at 01:08
i got no error with your code but it did not work. So not sure what was wrong. Below answer worked – RustyShackleford Jan 08 '19 at 01:52
@RustyShackleford I've updated the code, try now. Tell me what error you get. – Daniel Scott Jan 08 '19 at 01:53

Yuca · Answer 4 · 2019-01-07T22:11:56.820

0

IIUC:

ans = [my_dict[int(x[-1])] for x in df1.where(df1.loc[:,['value' in x for x in df1.columns]]).dropna(axis=1)]
pd.DataFrame.from_dict({v: k for k, v in dict(ans).items()}, orient='index').T

Output:

  category subcategory
0        a           f

edited Jan 07 '19 at 22:11

answered Jan 07 '19 at 22:06

Yuca

6,010
3
22
42

when I run the `ans` line I get error `ValueError: Boolean array expected for the condition, not object`. I have confirmed the column with value is listed in the column headers for the df – RustyShackleford Jan 08 '19 at 00:50
well it runs on my workbook, so god knows what's the issue with your implementation – Yuca Jan 08 '19 at 01:56

How to compare string value in dataframe column and value in cell to create new dataframe based on multi value dictionary?

4 Answers4