I am confronted with a rather specific problem when importing data with pandas. Another program produces an output text file like:
* FRUIT KEYWORD L WEIGHT
"Apple" "red" 0 1.0000003669009609
"Banana" "Yellow" 0 1.0000003669009609
which I would like to read into a pandas data frame using rad_table()
. Note that the real file doesn't start with the headers, there are lots of rows before the actual header.
Columns are separated with several white spaces, so using sep = r'\s+'
is generally fine. The problem comes with the first column where *
is followed by the actual column name FRUIT
- both separated by only one whitespace.
Using pd.read_table('example.txt', sep = r'\s+')
produces a data frame with the asterisk as first column header followed by FRUIT as next etc, ending up with the WEIGHT
column filled with Nan
. In fact, the first header should be FRUIT
and the *
ignored to prevent this shift of headers.
How can I tell pandas to ignore the single whitespace as separators and considering only multiple whitespaces? Or do I have to use another approach (maybe regex) to solve this?
I would like to prevent reading the whole frame first, extracting the columns, removing the *
and using this 'cleaned' list of headers to read in the data again with setting the correct headers. An 'on the fly' apporaoch would be nice.
Maybe someone could help me with this - I would be very happy!