43

I am trying to check if a certain value is contained in a python column. I'm using df.date.isin(['07311954']), which I do not doubt to be a good tool. The problem is that I have over 350K rows and the output won't show all of them so that I can see if the value is actually contained. Put simply, I just want to know (Y/N) whether or not a specific value is contained in a column. My code follows:

import numpy as np
import pandas as pd
import glob


df = (pd.read_csv('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas214.txt',\
    sep='|', header=None, low_memory=False, names=['1', '2', '3', '4', '5', '6', '7', \
    '8', '9', '10', '11', '12', '13', 'date', '15', '16', '17', '18', '19', '20', \
    '21', '22']))

df.date.isin(['07311954'])
YaOzI
  • 16,128
  • 9
  • 76
  • 72
Collective Action
  • 7,607
  • 15
  • 45
  • 60

3 Answers3

64

You can simply use this:

'07311954' in df.date.values which returns True or False


Here is the further explanation:

In pandas, using in check directly with DataFrame and Series (e.g. val in df or val in series ) will check whether the val is contained in the Index.

BUT you can still use in check for their values too (instead of Index)! Just using val in df.col_name.values or val in series.values. In this way, you are actually checking the val with a Numpy array.

And .isin(vals) is the other way around, it checks whether the DataFrame/Series values are in the vals. Here vals must be set or list-like. So this is not the natural way to go for the question.

YaOzI
  • 16,128
  • 9
  • 76
  • 72
35

I think you need str.contains, if you need rows where values of column date contains string 07311954:

print df[df['date'].astype(str).str.contains('07311954')]

Or if type of date column is string:

print df[df['date'].str.contains('07311954')]

If you want check last 4 digits for string 1954 in column date:

print df[df['date'].astype(str).str[-4:].str.contains('1954')]

Sample:

print df['date']
0    8152007
1    9262007
2    7311954
3    2252011
4    2012011
5    2012011
6    2222011
7    2282011
Name: date, dtype: int64

print df['date'].astype(str).str[-4:].str.contains('1954')
0    False
1    False
2     True
3    False
4    False
5    False
6    False
7    False
Name: date, dtype: bool

print df[df['date'].astype(str).str[-4:].str.contains('1954')]
     cmte_id trans_typ entity_typ state  employer  occupation     date  \
2  C00119040       24K        CCM    MD       NaN         NaN  7311954   

   amount     fec_id    cand_id  
2    1000  C00140715  H2MD05155  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I think you, I ran your suggested code and I still get a very long and incomplete list. – Collective Action Mar 12 '16 at 11:11
  • I just ran it, I was having some syntax errors earlier so stopped for a break. When I run the code above it points out the 1954 date; but when I run the code on the same data set after having after having implemented (http://stackoverflow.com/questions/35939558/using-timegrouper-1m-to-group-and-sum-by-columns-is-messing-up-my-date-index-p/35957110#35957110) it returns 'Empty DataFrame Columns: [cmte_id, trans_typ, entity_typ, state, date, amount, fec_id, cand_id] Index: []` – Collective Action Mar 12 '16 at 13:43
  • Jezrael, I'm going to look through the data files again and see just how many files have the date column market with a date that is "out of range". Theoretically, all of my dates should be between 2007 and 2014. Is there a way to use the code that you have posted above but print all values with the last 4 digits between 2007 and 2014? I would also like to delete all rows that occur as such, but first I need to locate them so I can inform the data source of the error in the data. If you can help me with that it would be great! – Collective Action Mar 12 '16 at 14:12
  • Please check [answer](http://stackoverflow.com/a/35958541/2901002), there was problem with parse column `date` to `datetime`. – jezrael Mar 12 '16 at 14:16
  • Are you able to put a list in there so you can check multiple values at once? – Gary Dorman Jun 21 '19 at 20:09
  • @GaryDorman - yes, then use [this](https://stackoverflow.com/a/26577689/2901002) solution – jezrael Jun 22 '19 at 05:36
32

You can use any:

print any(df.column == 07311954)
True       #true if it contains the number, false otherwise

If you rather want to see how many times '07311954' occurs in a column you can use:

df.column[df.column == 07311954].count()
Deusdeorum
  • 1,426
  • 2
  • 14
  • 23