0

I have a df that looks like this:

id.1.value.1        id.2.value.2      id.1.question    id.2.value.2
TRUE                     FALSE             TRUE             TRUE

I want to create logic that scans the column names of the df and extracts the last number only from column names that have value in column name and compare the value in the cell of the column that contains value with following logic:

  1. If value in column value equal TRUE then compare the last number in multi value dictionary

  2. use second value in multi key dictionary to create dataframe column names

Example:

my_dict = {1: ('a', 'category'),2: ('b', 'category'),\
           3: ('c', 'category'),4:('d','category'),\
           5:('e','subcategory'),6:('f','subcategory'),\
           7:('g','subcategory'),8:('h','subcategory'),\
           9:('i','subcategory'),10:('j','subcategory'),\
           11:('k','subcategor'),12:('l','subcategory'),\
           13:('m','subcategory'),14:('n','subcategory'),\
           15:('o','subcategory'),16:('p','subcategory'),\
           17:('q','subcategory'),18:('r','subcategory'),\
           19:('s','subcategory'),20:('t','subcategory'),\
           21:('u','subcategory'),22:('v','subcategory'),\
           23:('w','subcategory'),24:('x','subcategory')

           }

If my current df looks likes this:

id.1.value.1        id.2.value.2      id.1.question    id.6.value.6
    TRUE                 FALSE             TRUE             TRUE

New df should look like this:

category    subcategory
a               f
RustyShackleford
  • 3,462
  • 9
  • 40
  • 81

4 Answers4

1
names = df.columns
new_df = pd.DataFrame()
for name in names:    
    if ('value' in name) & df[name][0]:
        last_number = int(name[-1])
        key, value = my_dict[last_number]
        try:
            new_df[value][0] = list(new_df[value][0]) + [key]
        except:
            new_df[value] = [key]      
cors
  • 527
  • 4
  • 11
  • For some reason `name` is pulling up column that doesnt even contain the word `value`. I have tried adding periods after and before `value` in the `IF` statement. – RustyShackleford Jan 08 '19 at 00:48
  • I confirmed in the `names` the columns with `value` is present – RustyShackleford Jan 08 '19 at 01:01
  • I think the issue is that the `if` statement is not iterating through the entire list and i think `df[name]` is referencing the scientific notation of the number in the column I want like so `0 1.542210e+12 Name: data.answers.1542213646975.itemId, dtype: float64` I used 1,2,3 as examples – RustyShackleford Jan 08 '19 at 01:03
  • I changed `names = df.columns` to `names = df.columns.values`. Got error: `TypeError: unsupported operand type(s) for &: 'bool' and 'numpy.float64'` – RustyShackleford Jan 08 '19 at 01:10
  • Ignore the above error: I changed `names = df.columns` to `names = df.columns.values` and changed `&` to `and`. Which picks up the right column. However `last_number` is only getting the last digit of the entire number. Full column name is: `data.answers.0987654321.value.1234567890` as an example. I want the entire number after `value` to compare in the dict – RustyShackleford Jan 08 '19 at 01:17
  • changed `if ('value' in name) & df[name][0]: last_number = int(name[-1])` to `if ('value' in name) and df[name][0]: last_number = int(name[-13:])` – RustyShackleford Jan 08 '19 at 01:51
0

Where df,

   id.1.value.1  id.2.value.2  id.1.question  id.6.value.6
0          True         False           True          True

Use:

i = df.loc[:,df.columns[df.iloc[0]]].filter(like='value').columns.str.split('.').str[-1].astype(int).tolist()

my_dict = {1: ('a', 'category'),2: ('b', 'category'),\
           3: ('c', 'category'),4:('d','category'),\
           5:('e','subcategory'),6:('f','subcategory'),\
           7:('g','subcategory'),8:('h','subcategory'),\
           9:('i','subcategory'),10:('j','subcategory'),\
           11:('k','subcategor'),12:('l','subcategory'),\
           13:('m','subcategory'),14:('n','subcategory'),\
           15:('o','subcategory'),16:('p','subcategory'),\
           17:('q','subcategory'),18:('r','subcategory'),\
           19:('s','subcategory'),20:('t','subcategory'),\
           21:('u','subcategory'),22:('v','subcategory'),\
           23:('w','subcategory'),24:('x','subcategory')}

df1 = pd.DataFrame.from_dict(my_dict, orient='index')

df_out = df1.loc[i].set_index(1).T

print(df_out)

Output:

1 category subcategory
0        a           f
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • for the `i` variable, i get error: `ValueError: cannot index with vector containing NA / NaN values` I am only looking to extract the number after `value.` – RustyShackleford Jan 07 '19 at 21:58
  • Also how are you comparing the cell value if its `TRUE`? – RustyShackleford Jan 07 '19 at 22:00
  • Oh... I am using df.iloc[0] the first row of your input data frame. df.columns[df.iloc[0]] is the part where I am only selecting columns with the value of true in the dataframe. – Scott Boston Jan 07 '19 at 22:11
  • but wouldnt '0'be the first row of the df? if not 0 what should it be? – RustyShackleford Jan 07 '19 at 22:12
  • I am using iloc which is integer location so integer location numerical 0. – Scott Boston Jan 07 '19 at 22:12
  • If I dont use 0 what should I use? I tried 1 and -1 – RustyShackleford Jan 08 '19 at 01:02
  • .iloc[0] works with any dataframe. Index position start at zero, therefore all dataframes with a single row will have .iloc[0] Now, if your True False values are on a different row, then you can use .loc and identify that row by the index label using .loc instead of by integer location with .iloc. Your dataframe index could be 'A' 'B' 'C' and 'D'. df.iloc[0] will still get the first row. – Scott Boston Jan 08 '19 at 03:06
0
new_df = pd.DataFrame()

# get column names
for col in (list(df)):

    if "value" in col:

        try:
            # operate only in columns where a valid number is found
            value = df[col].rpartition('.')[:-1]

            # When df== True
            if df.loc[col,1]==True:
                new_df[my_dict[value][1]]= my_dict[value][0]
        except Exception as e:
            print(e)
Daniel Scott
  • 979
  • 7
  • 16
0

IIUC:

ans = [my_dict[int(x[-1])] for x in df1.where(df1.loc[:,['value' in x for x in df1.columns]]).dropna(axis=1)]
pd.DataFrame.from_dict({v: k for k, v in dict(ans).items()}, orient='index').T

Output:

  category subcategory
0        a           f
Yuca
  • 6,010
  • 3
  • 22
  • 42
  • when I run the `ans` line I get error `ValueError: Boolean array expected for the condition, not object`. I have confirmed the column with value is listed in the column headers for the df – RustyShackleford Jan 08 '19 at 00:50
  • well it runs on my workbook, so god knows what's the issue with your implementation – Yuca Jan 08 '19 at 01:56