1

I have a CSV file where there is an array in each row. I would like to convert the row contents to columns i.e. a Matrix at the end (since I have multiple rows). I can do it using a for loop and csv.reader - but it's quite slow. So, I had an idea that Pandas would be faster, and that I could do the conversion without the need for a loop. I read the file and get a Datframe type of Size (200,1) - where each row contains 700 floats that are comma separated, e.g. [0.4, 0.5, 0.3, ....]

If I do a .value on the output I just get it converted to an Object Type - still not usable...

I just can't figure out how to convert this data into a Matrix...

Am I looking in the wrong direction here?

ranges = pd.read_csv(name,usecols=['ranges'])

What does work is this:

X = open(name)
csv_X=csv.reader(X)
ranges = []next(csv_X)#jump over the first row in the csv
for row in csv_X:
    ranges.append(ast.literal_eval(row[14]))
X.close()

But that is just really slow. So, my idea about using Pandas is to speed this up.

mognic
  • 21
  • 5
  • 2
    `ranges = ranges.values` – Nihal Sangeeth Feb 12 '19 at 07:41
  • 1
    Possible duplicate of [Convert pandas dataframe to NumPy array](https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array) – Nihal Sangeeth Feb 12 '19 at 07:43
  • 1
    @NihalSangeeth As written in the post, .values does not work... The two posts you refer to do not have the exact same problem, and thus the solution does not comply. I have a Dataframe where a single column contains a float array in each row. I need to convert these float arrays to columns in a matrix, i.e. since there are 700 values in the array and 200 rows, I would have a matrix of 200,700 in size. – mognic Feb 12 '19 at 08:38

1 Answers1

2

With dataset looking like this:

                            range
0  [5, 5, 7, 5, 7, 2, 0, 4, 1, 6]
1  [1, 0, 6, 1, 1, 5, 7, 8, 6, 7]
2  [2, 0, 4, 6, 6, 6, 5, 1, 6, 5]
3  [5, 5, 2, 7, 1, 8, 7, 2, 8, 4]
4  [1, 5, 6, 6, 8, 2, 6, 6, 3, 1]

You can try:

pd.DataFrame(np.vstack(df.range.values))

which yields:

   0  1  2  3  4  5  6  7  8  9
0  5  5  7  5  7  2  0  4  1  6
1  1  0  6  1  1  5  7  8  6  7
2  2  0  4  6  6  6  5  1  6  5
3  5  5  2  7  1  8  7  2  8  4
4  1  5  6  6  8  2  6  6  3  1

Editted

If your rows are strings such as:

                ranges
0  8,9,7,6,3,2,4,1,8,3
1  7,9,9,2,1,6,4,1,8,2
2  9,3,0,9,7,7,0,9,9,6
3  0,7,1,0,5,5,1,2,4,2
4  3,3,8,0,8,7,3,6,6,2
5  9,3,7,6,5,7,8,3,8,7
6  1,6,7,8,5,6,7,0,7,8
7  5,5,0,9,2,1,5,4,3,4
8  3,8,9,8,6,3,8,5,9,8
9  8,5,1,7,1,4,8,1,6,4

Try:

pd.DataFrame(df.ranges.str.split(',').tolist())

which yields:

   0  1  2  3  4  5  6  7  8  9
0  8  9  7  6  3  2  4  1  8  3
1  7  9  9  2  1  6  4  1  8  2
2  9  3  0  9  7  7  0  9  9  6
3  0  7  1  0  5  5  1  2  4  2
4  3  3  8  0  8  7  3  6  6  2
5  9  3  7  6  5  7  8  3  8  7
6  1  6  7  8  5  6  7  0  7  8
7  5  5  0  9  2  1  5  4  3  4
8  3  8  9  8  6  3  8  5  9  8
9  8  5  1  7  1  4  8  1  6  4
Chris
  • 29,127
  • 3
  • 28
  • 51
  • If I do that I end up with 200 rows and 1 column still - the format is still a DataFrame.... Only thing that changed is that my column name is now 0 rahter than ranges" – mognic Feb 12 '19 at 08:27
  • Maybe I'm just not getting it, but I'm left with a DataFrame that does have the columns expected, but I am unable to convert it to anything usefull for further calculations. I have added the "slow" code I am trying to replace in the original post... – mognic Feb 12 '19 at 13:10