2

i have a Dataframe like this

Names;        Count;  PartNr
R1, R2,...Rn; n;       1234-5678
C1, C2;       2;       1234-6789

The list should be exported to a csv file for importing in another proprietary software. The software accepts up to 100 chars in the "Names" column, if there's more data, i should wrap the existing row, copy the whole column and add the remaining names. So there should be multiple rows with a maximum of 100 characters in the Names column. The absolute count of the parts should be only in the first row, so the Count value should be set to zero.

Names;             Count; PartNr
R1, R2,...Ra;        n;     1234-5678
Ra+1, Ra+2,...Rb;    0;     1234-5678   
Rb+1, Rb+2,...Rn;    0;     1234-5678
C1, C2;              2;     1234-6789

Is there a nice way, to modify this directly in pandas?

I tried to iterate through the rows, but i am not allowed to modify the dataframe i'm iterating through, so this didn't work. Any better solutions?

The Dataframes are from 10 to 1000 times long and only a few rows have too long Names, so performance isn't really important.

Franz Forstmayr
  • 1,219
  • 1
  • 15
  • 31
  • Have you tried iterating through `df_copy = df.copy()` instead of the original dataframe `df`? – Kris Jan 16 '17 at 15:40
  • You mean copying the original one, and then iterating and append again? Sounds good, i will try it. – Franz Forstmayr Jan 16 '17 at 15:41
  • 1
    Also, if you're allowed to "wrap" the `Rn` values, is there a reason why you wouldn't just put a single value on each row? – Kris Jan 16 '17 at 15:41
  • Yeah, that would be possible too, but the file would get very confusting in my opinion. There are only a few lines, which should get 'duplicated' – Franz Forstmayr Jan 16 '17 at 15:48

1 Answers1

1

I don't think there is a particularly nice way to do this in Pandas. Whenever a DataFrame holds a column of lists, and you want to do some kind of calculation which requires iteration over the list, you are forced to call Python functions once for each item (i.e. list) in the column. This hurts performance since there is no way to apply Pandas' fast vectorized operation here. The moral of the story is avoid putting lists into a DataFrame if possible.

(Of course, in your case, it seems you want to prepare a CSV in a particular format for another application. So if that requires lists in a DataFrame, so be it.)

As you mentioned, you can iterate over the rows. You could use iterrows or itertuples. I'll use itertuples since it tends to be faster:

import itertools as IT
import numpy as np
import pandas as pd

Rs = ['R{}'.format(i) for i in range(1,251)]
Cs = ['C1', 'C2']
df = pd.DataFrame({'Names': [Rs, Cs], 'Count': ['n',0], 
                   'PartNr':['1234-5678','1234-6789']})

def chunks(seq, n):
    # http://stackoverflow.com/a/312464/190597 (Ned Batchelder)
    """ Yield successive n-sized chunks from seq."""
    for i in range(0, len(seq), n):
        yield seq[i:i + n]

result = []
for row in df.itertuples():
    result.append(pd.DataFrame({'Names': list(chunks(row.Names, 100)),
                          'Count':row.Count,
                          'PartNr':row.PartNr}))

result = pd.concat(result, axis=0, ignore_index=True)
print(result)

yields

  Count                                              Names     PartNr
0     n  [R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11,...  1234-5678
1     n  [R101, R102, R103, R104, R105, R106, R107, R10...  1234-5678
2     n  [R201, R202, R203, R204, R205, R206, R207, R20...  1234-5678
3     0                                           [C1, C2]  1234-6789

Inside the itertuples for-loop, a new DataFrame is constructed for each row. The DataFrames are collected into a list called result. After the for-loop is done, the list of DataFrames is concatenated into a single DataFrame.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677