I have a large file of 10 million lines (150 MB) ['file.csv'], this have this form:
12|15|2
17|13|4
10|17|3
12|15|8
15|12|3
The first field indicates the column to which it belongs to the matrix, the second field indicates the row and the third field, indicates the value that will be added in the axes mentioned in the first fields.
I want to get a matrix similar to this (I think it's a goood idea):
10 17 12 13 15
10 3 4 0 0 0
17 0 0 0 0 0
12 0 0 0 0 10
13 0 0 0 0 0
15 0 0 2 0 0
After that we can obtain the maximum values, where they are and what they have value, I decide how many maximum values I can see.
For example, I decide to see 3 values maximum, the my result would be:
value: 10 / position: (12, 15)
value: 4 / position: (10,17)
value: 3 / position: (10,10)
I have tried this but I have a delay of 40 thousand lines every 15 minutes and in the university we have a server with 64 processors and 64 GB of RAM, it is 64 bit architecture
import pandas as pd
# Value to convert to GB
to_GB = 10
index_and_headers = list(open('index_and_headers.txt','r'))
index_and_headers = [int(member) for member in index_and_headers]
df = pd.DataFrame(0, index=index_and_headers, columns=index_and_headers)
print df
cont = 0
df2 = pd.read_csv('file.csv', sep='|', header=None)
print df2
def test(x):
global cont
print cont
x = df2.iloc[cont,1]
y = df2.iloc[cont,0]
print x
print y
df.loc[x, y] += float(df2.iloc[cont,2])/to_GB
cont += 1
def test_row(x):
df.apply(test, axis=1)
df.apply(test_row, axis=0)
print df
print df.idxmax()
print df.max()
This is too slow and even then I do not get the right result, please could you help me, any ideas about it, how can I reduce that time, thanks.
Something else, index_and_headers.txt is just a list of numbers that will serve as a name for indexes in the file and to locate the result I want.