1

Attempting to read a data frame that has values in a random rows/columns order and I would like to get a new column where the all the values containing 'that' are summarized.

Input:
    0       1       2       3       4
0   this=1  that=2  who=2   was=3   where=5
1   that=4  who=5   this=1  was=3   where=2
2   was=2   who=7   this=7  that=3  where=7
3   was=3   who=4   this=7  that=1  where=8
4   that=1  who=3   this=4  was=1   where=3


Output:
    0
0   that=2
1   that=4
2   that=3
3   that=1
4   that=1

I have been successfully able to get the correct result with the following code. But with larger data frames it takes a long time to complete

df1=pd.DataFrame([['this=1', 'that=2', 'who=2', 'was=3', 'where=5'],
                 ['that=4', 'who=5', 'this=1', 'was=3', 'where=2'],
                 ['was=2', 'who=7', 'this=7', 'that=3','where=7'],
                 ['was=3', 'who=4', 'this=7', 'that=1', 'where=8'],
                 ['that=1', 'who=3', 'this=4', 'was=1', 'where=3']], 
                columns=[0,1,2,3,4])

df2=pd.DataFrame()
for i in df1.index:
    data=[name for name in df1[i] if name[0:4]=='that']
    df2=df2.append(pd.DataFrame(data))
pacannon
  • 13
  • 3
  • shouldn't `that=2` exist? column `1` and index `0`? – MattR Jul 25 '19 at 19:41
  • [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Jul 25 '19 at 21:27

1 Answers1

1
df1[df1.apply(lambda x: x.str.contains('that'))].stack()

Let's break this down:

df1.apply(lambda x: x.str.contains('that')) Applies our lambda function to the entire dataframe. in english it reads: if that is in our value, True

       0      1      2      3      4
0  False   True  False  False  False
1   True  False  False  False  False
2  False  False  False   True  False
3  False  False  False   True  False
4   True  False  False  False  False

df1[] around that will return the values, instead of True/False:

        0       1    2       3    4
0     NaN  that=2  NaN     NaN  NaN
1  that=4     NaN  NaN     NaN  NaN
2     NaN     NaN  NaN  that=3  NaN
3     NaN     NaN  NaN  that=1  NaN
4  that=1     NaN  NaN     NaN  NaN

stack() will stack all the values into one Series. stack() gets rid of NA by default, which is what you needed.

if the extra index is tripping you up, you can also reset the index for a single series:

df1[df1.apply(lambda x: x.str.contains('that'))].stack().reset_index(drop=True)

0    that=2
1    that=4
2    that=3
3    that=1
4    that=1
dtype: object
MattR
  • 4,887
  • 9
  • 40
  • 67