2

I've encountered the following problem. I have data that's formatted nicely for the human eye but is terrible for a computer to read. E.g., it has a table format but without separators and its headers are word-wrapped, meaning that if it's longer than a specific length, the rest is written on a new line.

column 1        column 2 but          column 3
                with more text
1.5893001       1.513451              1.198420
1.5893001       1.513451              1.198420
1.5893001       1.513451              1.198420
1.5893001       1.513451              1.198420
etc...

I want the columns in a DataFrame to have the text that this header has. I'm already annoyed by the fact that I have to preprocess the data by manually converting it to a dict.

After loading this data into a DataFrame, the DataFrame's first row is the column labels where column2 is ofcourse only partially read. In the second row, the first value (of column 1) has the value of '', the second has 'with more text' and the third has NaN.

I've tried df.columns = df.iloc[0] + ' ' + df.iloc[1] which results in the second column having the correct label, but the others having nan.

Is there a way to make a conditional expression?

[edit] @jezrael for the correct formatting

{0: {0: 'column 1', 1: '', 2: 1.5893001, 3: 1.5893001}
 1: {0: 'column 2', 1: 'with more text', 2: 1.513451, 3: 1.513451}
 2: {0: 'column 3', 1: None, 2: 1.198420, 3: 1.198420}}
Alb
  • 1,063
  • 9
  • 33
  • What is `print (df.head().to_dict())` ? – jezrael Mar 15 '19 at 09:39
  • Also maybe better is upload some data sample file to dropbox, gdocs, because reading header is data dependent... Hard to verify with copy from text – jezrael Mar 15 '19 at 09:43
  • 1
    @jezrael I've edited my post and added df.head().to_dict() – Alb Mar 15 '19 at 09:47
  • If add parameter `header=[0,1]` to `read_csv` it working or error? – jezrael Mar 15 '19 at 09:49
  • 1
    @jezrael I'd have to write this df to a tmp csv file and reread it into a DataFrame. I'll report my findings – Alb Mar 15 '19 at 09:50
  • Perhaps just use StringIO instead of creating a tmp csv file? https://stackoverflow.com/questions/22604564/create-pandas-dataframe-from-a-string – Porz Mar 15 '19 at 09:53

1 Answers1

1

You can replace misisng values by Series.fillna adn remove traling spaces by str.strip, then remove first 2 rows with iloc:

df.columns = (df.iloc[0] + ' ' + df.iloc[1].fillna('')).str.strip()
df = df.iloc[2:].reset_index(drop=True)
print (df)
  column 1 column 2 with more text column 3
0   1.5893                 1.51345  1.19842
1   1.5893                 1.51345  1.19842
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252