2

The Scenario

I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type

The Problem

So far as per tried references (below) I've failed to get the output as required. The two column's values I'm trying to fetch are in int64 / float64, as below

         uid   iid       rat
0        196   242  3.000000
1        186   302  3.000000
2         22   377  1.000000

I'm intrested in only iid and rat for the moment, and to pass it to Kmeans.fit() method and that too not with EPSILON in it. I need it in following format

Expected format

[[242, 3.000000],
[302, 3.000000],
[22, 1.000000]]

Unsucessful Attempt

X = values[:, 1:2]
Y = values[:, 2:3]
someArray = np.array([X,Y])
print someArray

and doesn't farewell on execution

[[[  2.42000000e+02]
  [  3.02000000e+02]
  [  3.77000000e+02]
  ..., 
  [  1.35200000e+03]
  [  1.62600000e+03]
  [  1.65900000e+03]]
 [[  3.00000000e+00]
  [  3.00000000e+00]
  [  1.00000000e+00]
  ..., 
  [  1.00000000e+00]
  [  1.00000000e+00]
  [  1.00000000e+00]]]

Unhelped references so far

  1. This one
  2. This two
  3. This three
  4. This four

EDIT 1

tried np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True) and got this

[[             nan   1.96000000e+02   1.86000000e+02 ...,   4.79000000e+02
    4.79000000e+02   4.79000000e+02]
 [             nan   2.42000000e+02   3.02000000e+02 ...,   1.36000000e+03
    1.39400000e+03   1.65200000e+03]
 [             nan   3.00000000e+00   3.00000000e+00 ...,   2.00000000e+00
    1.92803605e+00   1.00000000e+00]]
T3J45
  • 717
  • 3
  • 12
  • 32

3 Answers3

3

Use label-based selection and the .values attribute of the resulting pandas objects, which will be some sort of numpy array:

>>> df
   uid  iid  rat
0  196  242  3.0
1  186  302  3.0
2   22  377  1.0
>>> df.loc[:,['iid','rat']]
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df.loc[:,['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])

Note, your integer column will get promoted to float.

Also note, this particular selection could be approached in different ways:

>>> df.iloc[:, 1:] # integer-position based
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df[['iid','rat']] # plain indexing performs column-based selection
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0

I like label-based because it is more explicit.

Edit

The reason you aren't seeing commas is an artifact of how numpy arrays are printed:

>>> df[['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(df[['iid','rat']].values)
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

And actually, it is the difference between the str and repr results of the numpy array:

>>> print(repr(df[['iid','rat']].values))
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Tried `print df.loc[:, ['iid','rat']].values` and got `[[ 2.42000000e+02 3.00000000e+00] [ 3.02000000e+02 3.00000000e+00] [ 3.77000000e+02 1.00000000e+00] ..., [ 1.36000000e+03 2.00000000e+00] [ 1.39400000e+03 1.92803605e+00] [ 1.65200000e+03 1.00000000e+00]]` – T3J45 Aug 10 '17 at 18:34
  • @Tejas that looks correct to me... what is the issue? – juanpa.arrivillaga Aug 10 '17 at 18:35
  • @Tejas but in general, don't dump that in a comment, it is unreadable. Edit **your actual question** – juanpa.arrivillaga Aug 10 '17 at 18:37
  • brother, that is not putting commas in between two values inside lists if you see. – T3J45 Aug 10 '17 at 18:51
  • Looks good with the added solution. That solves 50% of my problem, I need a wau out of those e+ or so called EPSILONS – T3J45 Aug 10 '17 at 18:55
  • @Tejas no, you don't. That's **just the way it's being printed**. – juanpa.arrivillaga Aug 10 '17 at 19:00
  • what is the issue with epsilons? When you print the array that is just how the data is displayed, it is still a float. – BenT Aug 10 '17 at 19:00
  • Well, that'd my obession with* clean data. Anyhow I'll proceed with clustering with epsilons. I hope that goes well. Greatful to you team! – T3J45 Aug 10 '17 at 19:02
  • @Tejas That is not an epsilon, that is an "e" which stands for "exponent". Anyway the "epsilons" aren't **actually there**. Again, it's an issue of how your data is being printed to the screen vs what is actually in the array. There `numpy` has pretty convoluted rules as to what, exactly, is the format it chooses to represent a float. Probably there is one or two very small or very large floats that require *scientific notation* to fit neatly into an array representation, so the rest are formatted using scientific notation as well. – juanpa.arrivillaga Aug 10 '17 at 19:18
  • Ok, I'm learning. So far I also learned Python is much robust & / flexible to other languages. So, I just gave a try. Anyhow I'm still struggling with getting the data in right format for clustering. Since repr appends array([]) at the beginning I'm unable to passing it on clustering. – T3J45 Aug 10 '17 at 19:28
  • @Tejas **No**. `repr` and `str` **return strings**. You want to pass the *array*. All you need is `df.loc[:,['iid','rat']]` You need to grok the difference between a string representation of an object and the object itself.. – juanpa.arrivillaga Aug 10 '17 at 19:29
  • Ok, I'll probably take this on chat room tomorrow 1000 IST. I'll explain the mess then. – T3J45 Aug 10 '17 at 19:31
  • @juanpa.arrivillaga Hey, I'll clear the confusion here. I'm using sklearn.cluster for KMeans clustering. The data needs to be in spatial format i.e. (x,y) to be plotted. For reference I'm using this [link](https://pythonprogramming.net/flat-clustering-machine-learning-python-scikit-learn/), you'll find the format of data that is passed to KMeans.fit(**input**) – T3J45 Aug 11 '17 at 05:58
2

Why don't you just import the 'csv' as a numpy array?

import numpy as np 
def read_file( fname): 
    return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True) 
BenT
  • 3,172
  • 3
  • 18
  • 38
  • tried `np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True)` and got this `[[ nan 1.96000000e+02 1.86000000e+02 ..., 4.79000000e+02 4.79000000e+02 4.79000000e+02] [ nan 2.42000000e+02 3.02000000e+02 ..., 1.36000000e+03 1.39400000e+03 1.65200000e+03] [ nan 3.00000000e+00 3.00000000e+00 ..., 2.00000000e+00 1.92803605e+00 1.00000000e+00]]` Ended up with a NaN, which I do not expect. – T3J45 Aug 10 '17 at 18:41
  • I can't actually see your data, so I do not know where that is coming from unless you have line numbers in your saved csv file that are strings. You could use indexing to remove the NaNs since the rest of the data looks correct. – BenT Aug 10 '17 at 18:51
  • I looks like for some reason the data is side ways, and you should transpose it and slice away the first row – DJK Aug 11 '17 at 01:02
1

It seems you need read_csv for DataFrame first with filter only second and third column first and then convert to numpy array by values: import pandas as pd from sklearn.cluster import KMeans from pandas.compat import StringIO

temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
   iid  rat
0    1    0
1    2    4
2    3    3
3    4    1

X = df.values 
print (X)
[[1 0]
 [2 4]
 [3 3]
 [4 1]]

kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Just for the record, the example I followed had commas in between elements of list and the lists, so does it mean that Python has a relief of such norms? btw that helped. – T3J45 Aug 12 '17 at 11:20
  • comma is default separator in read csv, if want change it use `sep='\t'` for tab, `sep='\s+'` for one or more whitespaces. – jezrael Aug 12 '17 at 11:21