0

I have a data frame df that contains many field names for a series of years.

                                                   field
year description                                               
1993 bar0                                       a01arb92
     bar1                                       a01svb92
     bar2                                       a01fam92
     bar3                                       a08
     bar4                                       a01bea93

Then, for every year, I have a stata file that has id as a column and as additional columns, some (or all) of the field names mentioned in df. For example, 1993.dta could be

id a01arb92 a01svb92 a08 a01bea93
0         1        1   1        1
0         1        1   1        2

I need to check for every year if all the fields listed in df really exist (as columns) in the corresponding file. I then would like to save the result back in the original data frame. Is there a nice way to do this without iterating over every single field?

Expected Output:

                                                   field   exists
year description                                               
1993 bar0                                       a01arb92        1
     bar1                                       a01svb92        1
     bar2                                       a01fam92        0
     bar3                                       a08             1
     bar4                                       a01bea93        1

For example, if every field but a01fam92 exists in the 1993 file as a column.

FooBar
  • 15,724
  • 19
  • 82
  • 171

2 Answers2

0

Try to go though every year, filter the data frame to get the fields associated with every specific year, then compare whether the elements are in the stata file or no.

 import pandas as pd
 d= pd.stata.read_stata("file")
  • Read your csv file, and store it in data frame
 import pandas as pd
 df= pd.read_csv("file")
  • Filter and extract the fields of every year.
df[df["year"]==1993].fields #Output: List of fields in year 1993

You can generlize the process by going through a list of years

l= df.year
for x in l:
   f= df[df["year"]==x].fields
   # Then check if f in strata file.

Here you find a detailed explanation on how to filter fields using Pandas.

  • Compare starata fields with the list you have

You can use All() Operator.

All(item for item in f if item in d)

If it's True, then all the element in the field are in the strata file.

To make everything in a function.

l= df.year #List of years
IsInDic={} #Dictinary to store a year:<All Fields in stata field> eg: {1993:True}
for x in l:
    f= df[df["year"]==x].fields
   # Then check if f in strata file.
    isInList= All(item for item in f if item in d)
    IsInDic[x]=isInList #Add everything in a dictionary to help you later decide whether it's true or no.

UPDATE

def isInList(x):
  return  [ x for x in d if x in df[df["year"]==x].fields] == d
Community
  • 1
  • 1
  • Well, thats the way I had initially in mind. But that's iterating over every file and then, after saving it in a dict, I'll suppose I have to iterate over that onto the original data frame. Is there no way to use the fact that both `df` and `d` are data frames? – FooBar Oct 28 '14 at 16:55
  • @FooBar check the UPDATE. What if we can use filtering. We create a filtered list where we append every element in d if it's in the fields, then compare the result to d. If we got the same list, it means that all the elements are in fields, false in the opposite case. –  Oct 29 '14 at 13:30
  • I think your update should read `return [...] == df[df["year"]==x].fields`. But then, I just know whether it contains *all* of the fields or not. To recover the expected output in the question, I would still need to iterate over all the fields, wouldn't I? – FooBar Oct 29 '14 at 14:48
  • True! you should go through the set of the years you have to select the fields related to every year (it would help you detect the problem earlier). –  Oct 30 '14 at 05:55
0

Here is a way to do it utilizing the fact that pandas will automatically fill NaN for missing indices.

First prepare the data. You may have already done this step.

df1 = pd.read_csv(r'c:\temp\test1.txt', sep=' ')

df1
Out[30]: 
   year description     field
0  1993        bar0  a01arb92
1  1993        bar1  a01svb92
2  1993        bar2  a01fam92
3  1993        bar3       a08
4  1993        bar4  a01bea93

df1 = df1.set_index(['year', 'description', 'field'])

df2 = pd.read_csv(r'c:\temp\test2.txt', sep=' ')

df2
Out[33]: 
   year description     field
0  1993        bar0  a01arb92
1  1993        bar1  a01svb92
2  1993        bar3       a08
3  1993        bar4  a01bea93

df2 = df2.set_index(['year', 'description', 'field'])

Next, create a new columns in df2 and use pandas to copy over that columns to the previous dataframe. This will fill NaN for missing values. Then use fillna to assign a value of 0.

df2['exists'] = 1

df1['exists'] = df2['exists']

df1
Out[37]: 
                           exists
year description field           
1993 bar0        a01arb92       1
     bar1        a01svb92       1
     bar2        a01fam92     NaN
     bar3        a08            1
     bar4        a01bea93       1

df1.fillna(0)
Out[38]: 
                           exists
year description field           
1993 bar0        a01arb92       1
     bar1        a01svb92       1
     bar2        a01fam92       0
     bar3        a08            1
     bar4        a01bea93       1
Karthik V
  • 1,867
  • 1
  • 16
  • 23
  • Thanks for the answer. It seems that my question was badly formulated: `df2` does not have the same structure as `df1`, it has the `fields` listed in `df1` as columns. I updated the question, I hope that helps. – FooBar Oct 29 '14 at 14:43