0

I am confronted with a rather specific problem when importing data with pandas. Another program produces an output text file like:

* FRUIT         KEYWORD     L      WEIGHT                    
 "Apple"       "red"       0      1.0000003669009609   
 "Banana"      "Yellow"    0      1.0000003669009609   

which I would like to read into a pandas data frame using rad_table(). Note that the real file doesn't start with the headers, there are lots of rows before the actual header.

Columns are separated with several white spaces, so using sep = r'\s+' is generally fine. The problem comes with the first column where * is followed by the actual column name FRUIT- both separated by only one whitespace.

Using pd.read_table('example.txt', sep = r'\s+') produces a data frame with the asterisk as first column header followed by FRUIT as next etc, ending up with the WEIGHT column filled with Nan. In fact, the first header should be FRUIT and the * ignored to prevent this shift of headers.

How can I tell pandas to ignore the single whitespace as separators and considering only multiple whitespaces? Or do I have to use another approach (maybe regex) to solve this?

I would like to prevent reading the whole frame first, extracting the columns, removing the * and using this 'cleaned' list of headers to read in the data again with setting the correct headers. An 'on the fly' apporaoch would be nice.

Maybe someone could help me with this - I would be very happy!

Number42
  • 133
  • 2
  • 10
  • Try using `\s{2,}` to require at least 2 whitespaces, or `(?<!^\*)\s+` to match 1 or more whitespace chars not preceded with `*` at the start of the string – Wiktor Stribiżew Dec 08 '17 at 12:42
  • Just so you know, using a regexp other than `\s+` will fall back to the python engine, which may have a huge impact on performance if you are reading huge table. – Holt Dec 08 '17 at 12:47

0 Answers0