Get max values and index of Pandas dataframe in large file

Question

I have a large file of 10 million lines (150 MB) ['file.csv'], this have this form:

12|15|2
17|13|4
10|17|3
12|15|8
15|12|3

The first field indicates the column to which it belongs to the matrix, the second field indicates the row and the third field, indicates the value that will be added in the axes mentioned in the first fields.

I want to get a matrix similar to this (I think it's a goood idea):

        10    17     12   13   15
10    3      4       0      0     0
17    0      0       0      0     0
12    0      0       0      0     10
13    0      0       0      0     0
15    0      0       2      0     0

After that we can obtain the maximum values, where they are and what they have value, I decide how many maximum values I can see.

For example, I decide to see 3 values maximum, the my result would be:

value: 10  /  position: (12, 15)
value: 4 / position: (10,17)
value: 3 / position: (10,10)

I have tried this but I have a delay of 40 thousand lines every 15 minutes and in the university we have a server with 64 processors and 64 GB of RAM, it is 64 bit architecture

import pandas as pd 



  # Value to convert to GB
    to_GB = 10

    index_and_headers =  list(open('index_and_headers.txt','r'))
  index_and_headers = [int(member) for member in index_and_headers]

    df = pd.DataFrame(0, index=index_and_headers, columns=index_and_headers)
    print df

    cont = 0
    df2 = pd.read_csv('file.csv', sep='|', header=None)
    print df2

    def test(x):
        global cont

    print cont
    x = df2.iloc[cont,1]
    y = df2.iloc[cont,0]
    print x
    print y

    df.loc[x, y] += float(df2.iloc[cont,2])/to_GB

    cont += 1


def test_row(x):

    df.apply(test, axis=1)

df.apply(test_row, axis=0)



print df
print df.idxmax()
print df.max()

This is too slow and even then I do not get the right result, please could you help me, any ideas about it, how can I reduce that time, thanks.

Something else, index_and_headers.txt is just a list of numbers that will serve as a name for indexes in the file and to locate the result I want.

https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe/47152692#47152692 — BENY, Mar 06 '18 at 15:05

Get max values and index of Pandas dataframe in large file

0 Answers0