0

I am doing data manipulation in Python. Currently, there are 231244 rows and 6750 cols in the data. My code for manipulation is below.

for i in range(231244):
  for j in range(len(data[i][0])):
    df.at[i,data[i][0][j]] = 1

This is the data row. It is basically a pickle file

The problem I am encountering is it is taking too much time. I leave it for one whole day on Google Colab and there are no results. In fact the session got restarted. Is there any method I can get results in a few minutes only?

Moiez
  • 1
  • 3
  • Does `df` happen to be an empty dataframe? i.e. is this growing the dataframe in a loop? Note, you are doing over a billion and a half iterations, it shouldn't take a while day. Unless... – juanpa.arrivillaga May 27 '20 at 08:47
  • @juanpa.arrivillaga df is basically an empty data frame with only column names – Moiez May 27 '20 at 09:04
  • @JanChristophTerasa I have attached an image link what data is – Moiez May 27 '20 at 09:05
  • **that's your problem** growing the dataframe this way takes quadratic time, since the entire underlying buffer is copied on each iteration. In the future, you should **always** provide a [mcve]. Do not post links or images (or even worse, links to images). Even leaving that aside, your description doesnt make sense, calling `data` "basically a pickle file" is simply not correct. Pickle is a binary object serialization format, that whatever object `data` refers to was created by deserializing a pickle file is neither here nor there, it is some python object with some specific type. – juanpa.arrivillaga May 27 '20 at 09:09
  • There are *much* more efficient ways to accomplish what you are trying to accomplish, but again, *you must provide a [mcve]*. Read the [following](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) question and the answers for advice on how to create goo, reproducible `pandas` examples – juanpa.arrivillaga May 27 '20 at 09:20
  • I slowly get to the point of understanding that this is a [XY problem](https://meta.stackexchange.com/a/66378/395122). – Jan Christoph Terasa May 27 '20 at 14:21

1 Answers1

1

Assuming that data[i][0] contains a list of column labels, you can get rid of the inner loop:

for i in range(231244):
    cols = data[i][0]
    df.loc[i, cols] = 1

As an example:

import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df.loc[0, ['a', 'b']] = 0
df.loc[1, ['c', 'a']] = 10
print(df)

    a  b   c
0   0  0   7
1  10  5  10
2   3  6   9
Jan Christoph Terasa
  • 5,781
  • 24
  • 34