iterating through data frame and addiing values if they are not present within that columns index

Question

I need to read data from a post gres server and put it into an array / data from. Each row has a source and and a destination field. I need to add these into an array cummulatively. As i iterate through the data frame, if the source and destination fields of are not in the accounts column, I need to add them into it.

Here is what my code currently looks like (excluding the postgres parts for brevity_)


# Load the data
data = pd.read_sql(sql_command, conn)

# taking a subet of the data until algorithm is perfected. 
seed = np.random.seed(42)

n = data.shape[0]
ix = np.random.choice(n,10000)
df_tmp = data.iloc[ix]

# Taking the source and destination and combining it into a list in another column 
df_tmp['accounts'] = df_tmp.apply(lambda x: [x['source'], x['destination']], axis=1)

# Attempt at cummulatively adding accounts to columns
for index, row in df_tmp.iterrows():
    if 'accounts' not in df_tmp:
        df_tmp['accounts'] = df_tmp.apply(lambda x: [x['accounts'], x['source'],x['destination']], axis=1)
    else:
         df_tmp['accounts'] =  df_tmp['accounts']

Here is what my data should look like:

Questions:

Is this the right way to do this?
The final row will have about 1 million accounts, which would make this very very expensive. Is this a more efficient way to represent this?

Can you add a small sample from your data and show what your expected outcome should look like? — stahamtan, Sep 03 '19 at 14:11
Why do you want to place data into a list (you call it "array") which is already present in the row in its own columns? — Michael Butscher, Sep 03 '19 at 14:11
Thanks for helping @SIA /@Michael Butscher. I have added pictures that should help clarify. Please let me know if you need anything else. In essence. the last row is a cummulative of all the accounts in the source and destination. The blocknumber represent epochs, so there is a time dependency. let me know if it makes sense — 0xsegfault, Sep 03 '19 at 14:30
So your `Accounts` column is supposed to carry the current row's (Source, Destination) pair concatenated with previous row's 'Accounts' value, is that correct? — stahamtan, Sep 03 '19 at 14:39
Precisely but only if they dont exist within it. I would like to preserve them as list/ array as i have more computations to carry out on this and dont want to have to parse strings — 0xsegfault, Sep 03 '19 at 14:40

score 1 · Accepted Answer · answered Sep 03 '19 at 15:05

1

You could use cumsum on the accounts columns to create a cumulative concatenation of the accounts values. Then convert the accumulated list to Set in order to keep the unique values.

There is a similar question answered here: Cumulative Set in PANDAS

df_tmp['accounts_acc'] = df_tmp['accounts'].cumsum().apply(set)

answered Sep 03 '19 at 15:05

stahamtan

848
6
10

I get the following error: `TypeError: unhashable type: 'list'` any ideas? – 0xsegfault Sep 03 '19 at 15:36
looks like you are trying to convert a list of lists to a Set. Run the above line right after you create `accounts` column, which is this line: `df_tmp['accounts'] = df_tmp.apply(lambda x: [x['source'], x['destination']], axis=1)` – stahamtan Sep 03 '19 at 15:42
1

managed to solve it but transforming df_tmp into a tuple instead of a list. Thanks alot! – 0xsegfault Sep 03 '19 at 15:56

iterating through data frame and addiing values if they are not present within that columns index

1 Answers1