1

I am using python for Machine Learning question. The data I have is as follows in format of csv with each line having a format: <class-label>, feature_1, feature_2,....

An example would be:

1,0,0,3,4,5
3,0,0,9,0,0
5,0,0,2,2,2
1,0,1,5,0,0
5,0,1,3,0,0
5,1,0,0,4,0

I need to split the data based on fist column. In the given case, I should have a dictionary having 3 entries with each having a value to a matrix of features. Of course, I can iterate through, but I am looking for more of a one-liner to do this.

EDIT: So the answer should look something like this:

1 => [ [0,0,3,4,5],
       [0,1,5,0,0]]
3 => [ [0,0,9,0,0]]
5 => [ [0,0,2,2,2],
       [0,1,3,0,0],
       [1,0,0,4,0]]
Aman Deep Gautam
  • 8,091
  • 21
  • 74
  • 130
  • In your example, there are 6 columns of csv chunks. What do you mean by 'split the data based on the first column'. The first of the 6 csv chunks or the first item of each csv chunk? Also, what if there are duplicates of your dictionary key? – vincent Oct 02 '15 at 17:36
  • @vincent edited the question. There will be duplicates in the first columns as they are class labels, so they should add into the matrix. – Aman Deep Gautam Oct 02 '15 at 17:43
  • Would you be okay with list of such matrices? – Divakar Oct 02 '15 at 17:53
  • @Divakar I do not want to loose the labels so would have to do some other manipulations. `dict` would have been ideal, but I guess I could work with list as well. – Aman Deep Gautam Oct 02 '15 at 17:56
  • Did my one liner work for you? – vincent Oct 02 '15 at 19:38

4 Answers4

1

with numpy tools:

tab=np.loadtxt('data.txt',delimiter=',',dtype=int)
labels,data=tab[:,0],tab[:,1:]
dic= {label: data[labels==label] for label in np.unique(labels)}    

give :

{1: array([[0, 0, 3, 4, 5],
    [0, 1, 5, 0, 0]]),
3: array([[0, 0, 9, 0, 0]]),
5: array([[0, 0, 2, 2, 2],
    [0, 1, 3, 0, 0],
    [1, 0, 0, 4, 0]])}
B. M.
  • 18,243
  • 2
  • 35
  • 54
0
 a = {}
 with open('infile.csv') as f:
      for line in f:
          L = line.strip().split(',')
          if L[0] in a.keys():
              a[L[0]].append(L[1:])
          else:
              a[L[0]] = [L[1:]]

this example uses array slicing which returns pieces of a list as a list

at the end a holds ...

{
 '1': [
    ['0', '0', '3', '4', '5'], 
    ['0', '1', '5', '0', '0']
      ],
 '3': [
     ['0', '0', '9', '0', '0']
      ],
 '5': [
      ['0', '0', '2', '2', '2'],
      ['0', '1', '3', '0', '0'],
      ['1', '0', '0', '4', '0']
      ]
}
Community
  • 1
  • 1
Ajay
  • 407
  • 4
  • 14
0

How about this?

from collections import defaultdict

dd = defaultdict(list)

lines = [
    '1,0,0,3,4,5',
    '3,0,0,3,4,5',
    '5,0,0,3,4,5',
    '1,0,0,3,4,5',
    '5,0,0,3,4,5',
    '5,0,0,3,4,5'
]

[ dd[line.split(',')[0]].append(line.split(',')[1:]) for line in lines ]

print dd

Then dd =

defaultdict(<type 'list'>, 
           {'1': [
                    ['0', '0', '3', '4', '5'],
                    ['0', '0', '3', '4', '5']
                 ],
            '3': [
                    ['0', '0', '3', '4', '5']
                 ],
            '5': [
                    ['0', '0', '3', '4', '5'],
                    ['0', '0', '3', '4', '5'],
                    ['0', '0', '3', '4', '5']
                 ]
           }
)
vincent
  • 1,370
  • 2
  • 13
  • 29
0

Assuming A has the data stored as a 2D numpy array, you can do something like this -

unqA = np.unique(A[:,0])
out = {unqA[i]:A[A[:,0]==unqA[i],1:] for i in range(len(unqA))}

Sample run -

In [109]: A
Out[109]: 
array([[1, 0, 0, 3, 4, 5],
       [3, 0, 0, 9, 0, 0],
       [5, 0, 0, 2, 2, 2],
       [1, 0, 1, 5, 0, 0],
       [5, 0, 1, 3, 0, 0],
       [5, 1, 0, 0, 4, 0]])

In [110]: unqA = np.unique(A[:,0])

In [111]: {unqA[i]:A[A[:,0]==unqA[i],1:] for i in range(len(unqA))}
Out[111]: 
{1: array([[0, 0, 3, 4, 5],
        [0, 1, 5, 0, 0]]),
 3: array([[0, 0, 9, 0, 0]]),
 5: array([[0, 0, 2, 2, 2],
        [0, 1, 3, 0, 0],
        [1, 0, 0, 4, 0]])}

If you are okay with a list of such matrices as the output, you could avoid looping like so -

sortedA = A[A[:,0].argsort()]
_,idx = np.unique(sortedA[:,0],return_index=True)
out = np.split(sortedA[:,1:],idx[1:],axis=0)

Sample run -

In [143]: A
Out[143]: 
array([[1, 0, 0, 3, 4, 5],
       [3, 0, 0, 9, 0, 0],
       [5, 0, 0, 2, 2, 2],
       [1, 0, 1, 5, 0, 0],
       [5, 0, 1, 3, 0, 0],
       [5, 1, 0, 0, 4, 0]])

In [144]: sortedA = A[A[:,0].argsort()]

In [145]: _,idx = np.unique(sortedA[:,0],return_index=True)

In [146]: np.split(sortedA[:,1:],idx[1:],axis=0)
Out[146]: 
[array([[0, 0, 3, 4, 5],
        [0, 1, 5, 0, 0]]), array([[0, 0, 9, 0, 0]]), array([[0, 0, 2, 2, 2],
        [0, 1, 3, 0, 0],
        [1, 0, 0, 4, 0]])]

Now, if you still want to have a dict-based output, you could use the output from above, like so -

out_dict = {sortedA[:,0][idx[i]]:out[i] for i in range(len(idx))}

giving us -

In [153]: out
Out[153]: 
[array([[0, 0, 3, 4, 5],
        [0, 1, 5, 0, 0]]), array([[0, 0, 9, 0, 0]]), array([[0, 0, 2, 2, 2],
        [0, 1, 3, 0, 0],
        [1, 0, 0, 4, 0]])]

In [154]: {sortedA[:,0][idx[i]]:out[i] for i in range(len(idx))}
Out[154]: 
{1: array([[0, 0, 3, 4, 5],
        [0, 1, 5, 0, 0]]),
 3: array([[0, 0, 9, 0, 0]]),
 5: array([[0, 0, 2, 2, 2],
        [0, 1, 3, 0, 0],
        [1, 0, 0, 4, 0]])}
Divakar
  • 218,885
  • 19
  • 262
  • 358