Figure out mismatch amongst two column values

Question

I have a data frame df1 that looks like -

user     data                               dep                    
1        ['dep_78','fg7uy8']                78
2        ['the_dep_45','34_dep','re23u']    45
3        ['fhj56','dep_89','hgjl09']        91

I want to focus on the column "data" with values containing the string "dep" and see if the number attached to that string matches with the number in the "dep" column. For example, dep_78 in data colum for user 1 matches with dep 78 in dep column. I want to output the rows with a mismatch. So the result should give me -

user     data                      dep
2        ['the_dep_45','34_dep']   45
3        ['dep_89']                91

The problem is to take only specific values in data column with string "dep" and then compare the numbers attached with those strings with the "dep" column.

The numbers attached with all the strings containing "dep" in the column "data", should match with the numbers in the "dep" column. dep_89 in data is a mismatch to 91 in dep column. — ComplexData, Aug 07 '17 at 21:22
It's my fault for looking on a phone, I missed `dep` in the first block. Still, I think your first step is splitting the strings in `data`? Why do you have a dataframe in this format in the first place? — roganjosh, Aug 07 '17 at 21:24
Can you provide some context for your question? What have you tried so far? Why not refactor your dataframe as suggested to you [here](https://stackoverflow.com/questions/45552952/extracting-specific-rows-from-a-data-frame/45553169#45553169)? — RagingRoosevelt, Aug 07 '17 at 21:24
Possible duplicate of [Extracting specific rows from a data frame](https://stackoverflow.com/questions/45552952/extracting-specific-rows-from-a-data-frame) — RagingRoosevelt, Aug 07 '17 at 21:26

score 0 · Answer 1 · answered Aug 07 '17 at 21:47

0

How about this?

import re

r = re.compile('\d+')

idx = df.apply(lambda x: str(x['dep']) in r.search(x['data']).group(0), axis=1)

0     True
1     True
2    False
dtype: bool


df[idx]

   user                             data  dep
0     1              ['dep_78','fg7uy8']   78
1     2  ['the_dep_45','34_dep','re23u']   45

answered Aug 07 '17 at 21:47

gold_cy

13,648
3
23
45

TypeError: ('expected string or buffer', u'occurred at index 0') – ComplexData Aug 08 '17 at 00:35

giser_yugang · Answer 2 · 2017-08-07T21:54:06.730

-1

You can do that

def select(row):
    keystring = 'dep_'+str(row['dep'])
    result = []
    for one in row['data']:
        if (one!=keystring)&('dep' in one):
            result.append(one)
    return result

df['data'] =df.apply(lambda x:select(x),axis=1)
df['datalength'] = df['data'].map(lambda x:len(x))
result = df[df['datalength']>0][df.columns[:3]]
print(result)
   user                  data  dep
1     2  [the_dep_45, 34_dep]   45
2     3              [dep_89]   91

edited Aug 07 '17 at 21:54

answered Aug 07 '17 at 21:31

giser_yugang

6,058
4
21
44

`[]` is less than ideal here. Surely the solution is to fix the initial DF? I don't get why everything is shoved in one column in the first place – roganjosh Aug 07 '17 at 21:35
@roganjosh You can filter them directly in length. – giser_yugang Aug 07 '17 at 21:41
Ok, but why even bother with pandas with this approach? It runs in python time, so you may as well just use a `for` loop – roganjosh Aug 07 '17 at 21:44
expected output is not what OP asked for. – gold_cy Aug 07 '17 at 21:48
@aws_apprentice I think you should see the qustion clearly, and your answer is the first two lines, but expect the output to be the last two lines – giser_yugang Aug 07 '17 at 22:00

Figure out mismatch amongst two column values

2 Answers2