-1

I want to sum all the value from the 3rd column to get the result in a new csv file with the first and second column using pandas which a think is more efficient.

The maximum values that can be added together are between 0 and 2

the addition will be ignored if there is a value or character other than 0.5 ,1 or 2

Sample of the csv file :

https://pastebin.com/WwDWqU3U

encounterId|chartTime|11885|67187|6711|6711|6710|1356|1357|1358|1359|1360|1361|1362|1366|140|140

325|2014-01-01 00:00:00|0
325|2014-01-01 01:00:00|0|0|0
325|2014-01-01 02:00:00|0
325|2014-01-01 03:00:00|0|0|0
325|2014-01-01 04:00:00|0
325|2014-01-01 05:00:00|1
325|2014-01-01 06:00:00|0|0|0
325|2014-01-01 07:00:00|1|0|0.5|1
325|2014-01-01 08:00:00|0
325|2014-01-01 09:00:00|1|0|0
325|2014-01-01 10:00:00|0
325|2014-01-01 11:00:00|1|0|0
325|2014-01-01 12:00:00|0
325|2014-01-01 13:00:00|0|0|0.5|1
325|2014-01-01 14:00:00|0
325|2014-01-01 15:00:00|0

what I'm looking for :

323|2013-06-03 00:00:00|0
323|2013-06-03 01:00:00|1
323|2013-06-03 02:00:00|1.5
323|2013-06-03 03:00:00|1.5
323|2013-06-03 04:00:00|0
323|2013-06-03 05:00:00|0.5
323|2013-06-03 06:00:00|0
323|2013-06-03 07:00:00|3.5
323|2013-06-03 08:00:00|0.5

I've tried without pandas it give me some weird result

Community
  • 1
  • 1
Zenix
  • 43
  • 1
  • 10

4 Answers4

1

You can just sum and set param axis=1, as suggested in a previous answer here

Rahul Gurnani
  • 174
  • 1
  • 13
  • Just flag as a duplicate rather than post this as an answer. However, this doesn't address the OP's requirement of only summing values between `0..2`. – AChampion Aug 28 '18 at 17:38
1

Use, this:

from io import StringIO
csvfile = StringIO("""323|2013-06-03 00:00:00|0|0|0
323|2013-06-03 01:00:00|1|
323|2013-06-03 02:00:00|1|0|0.5|86
323|2013-06-03 03:00:00|1|0|0.5|0
323|2013-06-03 04:00:00|0
323|2013-06-03 05:00:00|0|0|0.5|0
323|2013-06-03 06:00:00|0
323|2013-06-03 07:00:00|1|0|0.5|2
323|2013-06-03 08:00:00|0|0.5""")

df = pd.read_csv(csvfile, sep='|', names=['ID','date','A','B','C','D'])

df_out = df.set_index(['ID','date'])

df_out.where((df_out>0) & (df_out<=2), 0)\
      .sum(1)\
      .reset_index()\
      .to_csv('outfile.csv', index=False, header=False)

!type outfile.csv

Output:

323,2013-06-03 00:00:00,0.0
323,2013-06-03 01:00:00,1.0
323,2013-06-03 02:00:00,1.5
323,2013-06-03 03:00:00,1.5
323,2013-06-03 04:00:00,0.0
323,2013-06-03 05:00:00,0.5
323,2013-06-03 06:00:00,0.0
323,2013-06-03 07:00:00,3.5
323,2013-06-03 08:00:00,0.5
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • same problem as @rahlf23 `return _read(filepath_or_buffer, kwds) sys:1: DtypeWarning: Columns (0,6,7) have mixed types. Specify dtype option on import or set low_memory=False.` Maybe it's because i have two column same name – Zenix Aug 28 '18 at 19:00
  • I added a link to an exact copy of the csv file – Zenix Aug 28 '18 at 19:27
  • @Zenix Worked fine with your input file. – Scott Boston Aug 28 '18 at 20:08
  • I get this error : `F:\Téléchargement\score\venv\lib\site-packages\pandas\io\parsers.py:678: UserWarning: Duplicate names specified. This will raise an error in the future. return _read(filepath_or_buffer, kwds)` – Zenix Aug 29 '18 at 15:26
  • 1
    Okay, I am not getting this error because I was using names parameter in pd.read_csv https://github.com/pandas-dev/pandas/issues/10496 https://github.com/pandas-dev/pandas/issues/9424 – Scott Boston Aug 29 '18 at 15:56
1

Note that pd.read_csv() will throw an error if reading a csv with a variable number of columns, unless you supply column names in advance. This should do it:

import pandas as pd
import numpy as np

df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|')

df[df[['Val1','Val2','Val3','Val4']]>2] = np.nan

df['Final'] = df.iloc[:,2:].sum(axis=1)

df = df[['Index','Date','Final']]

Gives:

   Index                 Date  Final
0    323  2013-06-03 00:00:00    0.0
1    323  2013-06-03 01:00:00    1.0
2    323  2013-06-03 02:00:00    1.5
3    323  2013-06-03 03:00:00    1.5
4    323  2013-06-03 04:00:00    0.0
5    323  2013-06-03 05:00:00    0.5
6    323  2013-06-03 06:00:00    0.0
7    323  2013-06-03 07:00:00    3.5
8    323  2013-06-03 08:00:00    0.5

Here is a more succinct approach (it's very similar to @Scott Boston's answer below, but avoids the creation of a separate dataframe). Setting the first two columns of your csv as the indices of your dataframe allows you to conditionally filter the rest of your dataframe that contain only float values:

df = pd.read_csv('sample.txt', names=['Index','Date','Val1','Val2','Val3','Val4'], sep='|').set_index(['Index','Date'])

df['Final'] = df[(df>0) & (df<=2)].sum(axis=1)

df.reset_index()[['Index','Date','Final']].to_csv('output.csv', index=False, header=False)

Gives:

323,2013-06-03 00:00:00,0.0
323,2013-06-03 01:00:00,1.0
323,2013-06-03 02:00:00,1.5
323,2013-06-03 03:00:00,1.5
323,2013-06-03 04:00:00,0.0
323,2013-06-03 05:00:00,0.5
323,2013-06-03 06:00:00,0.0
323,2013-06-03 07:00:00,3.5
323,2013-06-03 08:00:00,0.5
rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • it retunr me this error return `return _read(filepath_or_buffer, kwds) sys:1: DtypeWarning: Columns (0,6,7) have mixed types. Specify dtype option on import or set low_memory=False. KeyError: ('11885', '67187', '6711', '6711', '6710', '1356', '1357', '1358', '1359', '1360', '1361', '1362', '1366', '140', '140')` – Zenix Aug 28 '18 at 18:51
  • I added a link to an exact copy of the csv file – Zenix Aug 28 '18 at 19:26
  • Why is it different from your sample csv that you posted in your question? – rahlf23 Aug 28 '18 at 19:42
  • I thought that with the file it allows you to reproduce the error I have. – Zenix Aug 29 '18 at 14:31
  • I'm asking why your sample csv in your OP is different from the link you added to your question. Both @Scott Boston and my answer will work for your sample data that you originally included. – rahlf23 Aug 29 '18 at 14:37
  • I just removed the header, normally that's no problem, no? i edited the post with same as the file – Zenix Aug 29 '18 at 14:42
  • That is a totally different file, those lines have a completely different pattern..... – rahlf23 Aug 29 '18 at 14:44
  • Sorry, I didn't have this vision of things, if you can help I thank you otherwise I would repost it correctly – Zenix Aug 29 '18 at 14:48
  • Now your expected output does not correspond to your sample data. In any case, look into the `pd.read_csv()` module, you will want to use the `skiprows` keyword. – rahlf23 Aug 29 '18 at 14:51
  • How skipping rows is going to help me achieve what i want – Zenix Aug 29 '18 at 15:22
  • It doesn't look like you used `header=False` because you are attempting to set your column names using the first line in your csv. Please read through the `pd.read_csv()` module, the documentation is quite thorough. – rahlf23 Aug 29 '18 at 15:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179045/discussion-between-zenix-and-rahlf23). – Zenix Aug 29 '18 at 17:01
0

How about this?

for row in df.rows:
   row[row.columns[2]]=sum(row[row.columns[>1]])
Matthew
  • 115
  • 10