Speed up nested for loop processing in Python

Question

I am doing data manipulation in Python. Currently, there are 231244 rows and 6750 cols in the data. My code for manipulation is below.

for i in range(231244):
  for j in range(len(data[i][0])):
    df.at[i,data[i][0][j]] = 1

This is the data row. It is basically a pickle file

The problem I am encountering is it is taking too much time. I leave it for one whole day on Google Colab and there are no results. In fact the session got restarted. Is there any method I can get results in a few minutes only?

Does `df` happen to be an empty dataframe? i.e. is this growing the dataframe in a loop? Note, you are doing over a billion and a half iterations, it shouldn't take a while day. Unless... — juanpa.arrivillaga, May 27 '20 at 08:47
@juanpa.arrivillaga df is basically an empty data frame with only column names — Moiez, May 27 '20 at 09:04
@JanChristophTerasa I have attached an image link what data is — Moiez, May 27 '20 at 09:05
**that's your problem** growing the dataframe this way takes quadratic time, since the entire underlying buffer is copied on each iteration. In the future, you should **always** provide a [mcve]. Do not post links or images (or even worse, links to images). Even leaving that aside, your description doesnt make sense, calling `data` "basically a pickle file" is simply not correct. Pickle is a binary object serialization format, that whatever object `data` refers to was created by deserializing a pickle file is neither here nor there, it is some python object with some specific type. — juanpa.arrivillaga, May 27 '20 at 09:09
There are *much* more efficient ways to accomplish what you are trying to accomplish, but again, *you must provide a [mcve]*. Read the [following](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) question and the answers for advice on how to create goo, reproducible `pandas` examples — juanpa.arrivillaga, May 27 '20 at 09:20
I slowly get to the point of understanding that this is a [XY problem](https://meta.stackexchange.com/a/66378/395122). — Jan Christoph Terasa, May 27 '20 at 14:21

Jan Christoph Terasa · Answer 1 · 2020-05-27T14:16:58.367

1

Assuming that data[i][0] contains a list of column labels, you can get rid of the inner loop:

for i in range(231244):
    cols = data[i][0]
    df.loc[i, cols] = 1

As an example:

import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df.loc[0, ['a', 'b']] = 0
df.loc[1, ['c', 'a']] = 10
print(df)

    a  b   c
0   0  0   7
1  10  5  10
2   3  6   9

edited May 27 '20 at 14:16

answered May 27 '20 at 08:43

Jan Christoph Terasa

5,781
24
34

```data[i][0]``` contains list. each row has different list lengths – Moiez May 27 '20 at 09:12
So, did it work or help? I do not know how or if access to columns in a DataFrame can be vectorized that way. – Jan Christoph Terasa May 27 '20 at 12:53
it didn't work. As it was taking whole list this way not a single list value – Moiez May 27 '20 at 13:19
@Moiez That was the point. – Jan Christoph Terasa May 27 '20 at 14:12

Speed up nested for loop processing in Python

1 Answers1