-1

I have a fairly large dataset, I would like to sum values equal to 1 for every row.

input file:

1526  0 1 2 1 0
782   0 1 1 1 2
7653  1 1 1 0 0
87bt  1 0 1 2 2

desired output file:

1526 2
782 3
7653 3
87bt 2

my code:

df = pd.read_csv('data1', delimiter=' ')

 df_sub = df.iloc[:,1:]

sum1=0

for het in df_sub:

     if het==1 :

      sum1=sum1+1
 print(sum1)

 
Erfan
  • 40,971
  • 8
  • 66
  • 78
El.h
  • 31
  • 7
  • You don't have column names? And can you explain your logic. – Erfan Jul 28 '20 at 17:41
  • yes no column names. I would like print or write to a file the first column values and their corresponding sums of "1" – El.h Jul 28 '20 at 17:51
  • avoid python looping with large datasets ... https://stackoverflow.com/a/55557758/6692898 – RichieV Jul 28 '20 at 17:53
  • @Ch3steR honestly I dont like the implicit use of `axis=1` with just `sum(1)`, it's so confusing syntax, especially in your example. In the first part it means literally the integer 1, while in the second part its the axis argument. – Erfan Jul 28 '20 at 17:55

1 Answers1

1

You can use df.eq with df.sum here. I suggest using index_col parameter in pd.read_csv to set index while reading the csv itself.

from io import StringIO
text = '''1526  0 1 2 1 0
782   0 1 1 1 2
7653  1 1 1 0 0
87bt  1 0 1 2 2'''

df = pd.read_csv(StringIO(text), header=None, index_col=0) #`index_col=0` sets 1st column as index
df.eq(1).sum(axis=1)
0
1526    2
782     3
7653    3
87bt    2
dtype: int64

You can use np.count_nonzero if performance is an issue, it's significantly faster than df.eq(...).sum(...), timeit results here

np.count_nonzero(df.to_numpy()==1, axis=1)
# array([2, 3, 3, 2], dtype=int64)
# pd.Series(np.count_nonzero(df.to_numpy()==1, axis=1), index=df.index)
# This is almost 3X faster than `df.eq(...).sum(...)`
# For more details refer to https://stackoverflow.com/a/63103435/12416453

axis=1 means "over the column axis", pandas would also accept:

df.eq(1).sum(axis='columns')
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • df = pd.read_csv("data1", header=None, index_col=0,delimiter=' ') a = pd.Series(np.count_nonzero(df.to_numpy()==1, axis=1), index=df.index) this prints just 0s for all rows – El.h Jul 30 '20 at 17:12