0

Yes, you won't believe it, but i've been browsing for two hours for a simple line of code. HOW DO YOU convert DataFrame full of strings to float Dataframe. Even more, how do you convert DataFrame full of string to np.array? There seems to be two solutions which are being suggested over and over - convert objects and astype. None of them work.....

database = pd.read_csv('test1.csv',header=None)
database


Out[165]:
0
0   0,0,1,0,0
1   1,0,1,0,0
2   0,4,0,1,0
3   1,4,0,1,0
4   1,1,0,0,1
5   2,1,0,0,1

database = database.astype(str).convert_objects(convert_numeric=True)
x = np.array(database)

In [170]:
x
Out[170]:
array([['0,0,1,0,0'],
       ['1,0,1,0,0'],
       ['0,4,0,1,0'],
       ['1,4,0,1,0'],
       ['1,1,0,0,1'],
       ['2,1,0,0,1']], dtype=object)

OR

DATA = database.astype(float)
ValueError: could not convert string to float: '2,1,0,0,1'
Mr.Robot
  • 397
  • 8
  • 21
  • Iterating through each and every item in the list is not an option as i am going to use this method for huge database. – Mr.Robot Nov 23 '15 at 20:30
  • you have to convert the string into a list first with something like `str.split(',')` then you can convert each item to a float. – R Nar Nov 23 '15 at 20:31
  • like i said, its not really an option as i will have to do that to like 20000 rows... unless Python doesn't provide a normal option for such thing – Mr.Robot Nov 23 '15 at 20:36
  • have you tried explictly stating `sep = ','` in your `pd.read_csv` call? – R Nar Nov 23 '15 at 20:44
  • Is the whole row enclosed in quotes? If yes, you can use the Linux terminal to strip quotes from the ends of the rows quickly. If you are on Windows, or if only a part of your whole row is like this, you can use `pd.to_csv()` and write either the whole dataframe or just the problem column back to a csv file, and pass `quotechar=None` while you are doing it. Reading this csv file again should solve your problem. I am not able to think of any other better solution at the moment (without resorting to iteration). – Kartik Nov 23 '15 at 20:45
  • The one thing that is sure is that `np.array()` and `.astype(float)` would work, but that it is most likely something about the `.csv` itself. Perhaps you actually do have a header line? If you'd copy&paste `test1.csv` we would surely be able to help you out. As it stands, it looks like it reads every line on its own. – PascalVKooten Nov 23 '15 at 20:51
  • R Nar, that does nothing unfortunately... :| – Mr.Robot Nov 23 '15 at 20:52
  • hm... Pascalv how can i do that? Upload file somewhere? – Mr.Robot Nov 23 '15 at 20:52
  • @Mr.Robot No newlines? Put this line also in your question. – PascalVKooten Nov 23 '15 at 20:53
  • @Mr.Robot I added an answer that could be updated when your test format is known. I am convinced it has to be with the format rather than having to convert. – PascalVKooten Nov 23 '15 at 20:59
  • Nevermind i just done everything manual via excel. Text to columns. And after that panda understood that it was number. Then it was easy to just database = np.array(olddatabase). – Mr.Robot Nov 23 '15 at 21:00
  • If it really was so "big" data as you stated, excel would have had absolutely no chance. Excel chokes at 100k, python won't choke until much later... If you would just open up that csv in the first place you'd see what would be wrong. – PascalVKooten Nov 23 '15 at 21:01

1 Answers1

0

From what I've seen in your comment, this might help:

import pandas
pandas.read_csv('test1.csv', header=None, lineterminator=' ')
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • after that adding lineterminator i get this result: `0 1 2 3 4 5 6 7 8 9 ... \ 0 0,0,1,0,0\r\n"1 0 1 0 0"\r\n"0 4 0 1 0"\r\n"1 4 ... 11 12 13 14 15 16 17 18 19 20 0 1 0"\r\n"1 1 0 0 1"\r\n"2 1 0 0 1"\r\n` – Mr.Robot Nov 23 '15 at 21:05
  • You have got yourself a really messed up csv. – PascalVKooten Nov 23 '15 at 21:40