0

Most questions here are related to finding a pattern like string in a specific column and do something with it. But what if I don't know the column?
Link to Q/A for a specific column: Link

I try to compare two dataframes, to make sure that they match, no columns have been added or rows deleted. One of those files is like a template. Where a group stands for a value range.

An example:

template = pd.DataFrame(
        {'Headline': ['Subheading', '', 'Animal', 'Tiger', 'Bird', 'Lion'],
         'Headline2': ['', 'Weight', 2017, 'group1', 'group2', 'group3'],
         'Headline3': ['', '', 2018, 'group1', 'group2', 'group3']
         })

testfile = pd.DataFrame(
        {'Headline': ['Subheading', '', 'Animal', 'Tiger', 'Bird', 'Lion'],
         'Headline2': ['', 'Weight', 2017, 150, 15, 201],
         'Headline3': ['', '', 2018, 152, 12, 198]
         })

     Headline Headline2 Headline3
0  Subheading                    
1                Weight          
2      Animal      2017      2018
3       Tiger    group1    group1
4        Bird    group2    group2
5        Lion    group3    group3

     Headline Headline2 Headline3
0  Subheading                    
1                Weight          
2      Animal      2017      2018
3       Tiger       150       152
4        Bird        15        12
5        Lion       201       198

If I would do this print((template == testfile).all().all()), it would be False.

As a human, I know that row three to five differs, so I want to exclude them from my comparison:

drop_r = [3, 4, 5]
template = template.drop(template.index[drop_r])
testfile = testfile.drop(testfile.index[drop_r])

Then I would get print((template == testfile).all().all()) is True

So how can I get all row numbers into the object drop_r for the condition that a row contains group[n].

I need to find the rows in template where the substring 'group' appears in any of the columns?

Erik Steiner
  • 581
  • 1
  • 5
  • 18
  • Please provide a [Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help/mcve) and see [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Alex Aug 08 '18 at 13:36
  • @Alex: I changed the example a bit, because I thought that it would be clear enough. Now my request should be clearer. – Erik Steiner Aug 08 '18 at 14:12
  • Your question is still a bit confusing. Do you need to find the rows where `testfile != template` or do you need to find the rows in template where the substring `'group'` appears in any of the columns? – ALollz Aug 08 '18 at 14:57
  • @ALollz: Maybe it is because of the heat. Your last suggestion: I need to find the rows in template where the substring 'group' appears in any of the columns? – Erik Steiner Aug 08 '18 at 15:04

1 Answers1

2

To check if 'group' is found anywhere in the DataFrame you can stack it, check if anything contains 'group' and then get the indices.

import numpy as np

ids = np.unique(template.stack()[template.astype('str').stack().str.contains('group')].index.get_level_values(0))
#array([3, 4, 5], dtype=int64)

To see the bad rows:

template.reindex(ids)
#  Headline Headline2 Headline3
#3    Tiger    group1    group1
#4     Bird    group2    group2
#5     Lion    group3    group3

To see the good rows:

template[~template.index.isin(ids)]
#     Headline Headline2 Headline3
#0  Subheading                    
#1                Weight          
#2      Animal      2017      2018
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • wow, that's even better. I found out this solution by myself, but yours is a lot better: Define groups `drop_r = ['group1', 'group2', 'group3']` and show the dataframe without those groups `testfile.loc[~template.isin(drop_r).any(axis=1)]`. My solution would need a list of familiar groups. Thank you! – Erik Steiner Aug 09 '18 at 06:28