0

I have a date with 6 thousand lines and 1 row. Each one of these 6 thousand lines have 5 information, and it has space between them. I want to transform each one of these 5 information into 1 row (on pandas) So, It would be 6 thousand lines and 5 rows.

I have something like this:

Name          B-V (mag)      VMAG           Plx(mas)    logRpHK

HD10697      0.72000000000   3.706399082220  30.71  -5.01849947767

HD10697      0.70500040054   3.714691682840  30.70  -5.04175893038

HIP8159      0.72000000000   3.706399082220  30.71  -5.05434051352

HD83951      0.36000013351   2.507321229600  18.77  -4.45974790621

Pandas reads it as just one column, and I would like it to be 5 column. The Name column, the B-V(mag) column, the VMAG column, the Plx(mas) column and the logRphk column. So the first information in each line would be in the first column, the second in the second column... and so on. Hope you can understand what I mean. Sorry for anything, english is not my first language. Thx

Daniel Walker
  • 6,380
  • 5
  • 22
  • 45
  • I think you mean columns instead of rows – Camilo Martinez M. May 16 '21 at 01:32
  • Maybe you want to [change the delimiter](https://stackoverflow.com/a/33524402/9997212) `pandas` is using to read the file? Since you said there are spaces delimiting each column, you can do `sep=' '` when reading it. – enzo May 16 '21 at 01:43
  • Yes, Camilo. I meant columns instead of rows. I messed it up. Just edited. – Samara Monteiro May 16 '21 at 01:56
  • I tried something like "pd.read_csv('data', sep='\t'). So now I have 5 columns, but they are named as "unnamed" and the information in each column is " NaN". Idk why. – Samara Monteiro May 16 '21 at 02:00
  • `import io;data = '''... ''';df = pd.read_csv(io.StringIO(data), delim_whitespace=True)` You can use this to read whitespace as a split key. However, if there are spaces in the column names, the number of columns will not match, so you need to enclose the column names with spaces in single quotation marks. – r-beginners May 16 '21 at 02:19

1 Answers1

0

The header (1st line) is not properly aligned with the data, so it is treated separately.

import re

with open("data.csv") as data:
    headers = re.split(r"\s\s+", data.readline().strip())
    df = pd.read_table(data, delim_whitespace=True, header=None, names=headers)
>>> df
      Name  B-V (mag)      VMAG  Plx(mas)   logRpHK
0  HD10697      0.720  3.706399     30.71 -5.018499
1  HD10697      0.705  3.714692     30.70 -5.041759
2  HIP8159      0.720  3.706399     30.71 -5.054341
3  HD83951      0.360  2.507321     18.77 -4.459748
Corralien
  • 109,409
  • 8
  • 28
  • 52