2

I am working on large table using python (using pandas library).

I would like to perform various kind of vector operations such as Correlation with each rows of the table.

It might be a simple problem, but for me it is difficult to deal with the DataFrame structure. I do not have a good idea about how to convert each row (or column) into lists (or numpy arrays).

Even counting the number of rows does not seem to be a simple problem, because function like df.count() seems to ignore null data.

Simple data table and the expected result table are like below. In this case, I would like to calculate sum of each row pairs.

The size of real table is much bigger (more than 1000 rows and columns) and contains some null values.


Data.csv:

Label Col1 Col2
Row1 1 2
Row2 3 4
Row3 5 6

Output.csv:

Label Col3
Row1,Row2 4,6
Row1,Row3 6,8
Row2,Row3 8,10
ToBeSpecific
  • 173
  • 2
  • 14
  • What do you mean with `null values`? Is there an empty or a `NaN` value or is the value just equal to zero. What do you want the output to be like if there is such a `null value`? – albert Nov 25 '15 at 09:05
  • @albert Sorry for my poor explanation. You may think it as NaN value. As my real dataset is converted from image(to float value), there are some values. But in this case it does not matter because I remove them as a preprocessing. – ToBeSpecific Nov 25 '15 at 09:11
  • 1
    You could get number of rows with `shape` method: `df.shape[0]` will be amount of rows. – Anton Protopopov Nov 25 '15 at 09:30
  • @Anton Protopopov Thank you for your advice. I confirmed that shape[0] returned the number of rows(except label) including null values. Also, shape[1] returned the number of columns, including lables and null values. – ToBeSpecific Nov 25 '15 at 09:38

2 Answers2

1

Part of the solution, because you'll have duplicated rows with slightly different names so you couldn't apply drop_duplicates method of dataframes:

import pandas as pd
from io import StringIO

data = """
Label Col1 Col2
Row1 1 2
Row2 3 4
Row3 5 6
"""

df1 = pd.DataFrame()

for row in range(df.shape[0]):
   df1 = pd.concat([df1, df.ix[row,:] + df[df['Label'] != df.Label[row]]])

df1.reset_index(drop=True, inplace=True)

In [103]: df1
Out[103]:
      Label Col1 Col2
0  Row1Row2    4    6
1  Row1Row3    6    8
2  Row2Row1    4    6
3  Row2Row3    8   10
4  Row3Row1    6    8
5  Row3Row2    8   10
Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
  • I thought that I have to convert DataFrame into lists, but it seems that it can be done directly with DataFrame. Although some more work would be needed, thank you very much for your answer. – ToBeSpecific Nov 25 '15 at 10:36
  • @ToBeSpecific show your code when you'll finish that. – Anton Protopopov Nov 25 '15 at 11:31
  • At present, I could only do this by making lists from DataFrame and run double for-loop for each rows, which seems like a bruteforce method. I would like to find the solution using DataFrame itself. – ToBeSpecific Nov 25 '15 at 13:06
1

Pandas is a lot faster and more natural when working with columns. Thus, I would propose to transpose DF first, and then just sum columns

Link: Invert index and columns in a pandas DataFrame

Community
  • 1
  • 1
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • Thank you for your advice. I knew that transposing table can be easily done by T function, but I did not know that working with columns is much better. I'd better find solution using DataFrame. – ToBeSpecific Nov 25 '15 at 15:22
  • 1
    @ToBeSpecific yes, pandas DataFrame is basically collection/list of columns of different type, as opposed to relational DB which is collection of rows. Each column is fast to operate on for typical stat functions like sum(), mean(), variance() etc. Another advice for computational purposes coerce columns into native numeric type (like float64 or int). Check http://stackoverflow.com/questions/18434208/pandas-converting-to-numeric-creating-nans-when-necessary – Severin Pappadeux Nov 25 '15 at 22:23
  • Though my goal is to use a little more complex functions, such as euclidean distance between each vectors, it seems to be applied in the same way. In fact, I thought that working with rows is more natural before. Thank you very much for your tips. – ToBeSpecific Nov 26 '15 at 04:44