0

A text file looked like this. I want to covert it into a CSV file.

enter image description here

Water level.txt

With Pandas, when I used:

df = pd.read_fwf(f)

It looks like:

enter image description here

It seems there are tab and space used for delimiters, I changed the line to:

df = pd.read_csv('Water level.txt' ,  sep = '[" "|\t]', encoding='GBK', engine = 'python')

But it warns:

pandas.errors.ParserError: Expected 14 fields in line 4, saw 16. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

What's the right way with Python to convert it into a CSV file?

martineau
  • 119,623
  • 25
  • 170
  • 301
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • 1
    Note that `[" "|\t]` includes matches for the pipe - you're inside a character class, it doesn't mean "or" there. You're also matching a *single* space, so you probably have a few fields containing an empty string. In terms of dealing with "ragged" data, see https://stackoverflow.com/q/46127026/3001761. – jonrsharpe Jan 04 '21 at 17:43
  • 1
    Are any of these quoted fields with spaces, e.g. `"North East"`, or worse, unquoted fields with spaces? The regex parser will treat `"North East"` as two columns `("North", "East")`. – tdelaney Jan 04 '21 at 17:52
  • @jonrsharpe, thank you for the comment. that's, to tell the read_csv the number of columns. :) – Mark K Jan 04 '21 at 17:55
  • @tdelaney, thank you for your comment. fortunately not, there are only 'East' and 'West'. – Mark K Jan 04 '21 at 17:56
  • If you want to separate with any number of tabs or spaces, that would be `sep = r'[ \t]+'` – tdelaney Jan 04 '21 at 17:58
  • @tdelaney, I tried sep = r'[ \t]+', but it still gives ParserError... – Mark K Jan 04 '21 at 18:02
  • 1
    Pandas needs to guess how many columns there are. In your example, the file starts with 4 columns and moves to 5. But by the time that 5th column comes along, pandas has already decided there are only 4. Since there is no header line defining all of the columns, you can add a `names=["foo", "bar", "baz", etc...]` (using 14, more descriptive names) to the `read_csv`. You could even do `names=range(14)` if you just want integer column names. – tdelaney Jan 04 '21 at 18:14
  • @tdelaney, thanks again. would you consider to put on your comments as an answer so we can close this question? – Mark K Jan 05 '21 at 01:20

2 Answers2

1

Try passing in the column widths if the data structure doesn't change. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html There's other options here as well with read_fwf.

Verify the widths argument is correct:

pd.read_fwf('JcP65rQY5F2Y.txt', widths=[5,10,9,2,5])


    Unnamed: 0  Unnamed: 1  Unnamed: 2  Unnamed: 3 Unnamed: 4
0        09:25        7.54         288          17        NaN
1        09:30        7.55          20           6        NaN
2        09:30        7.55           7           2       East
3        09:30        7.55          11           3       East
4        09:30        7.56           5           4       West
..         ...         ...         ...         ...        ...
194      09:59        7.60           3           1       East
195      09:59        7.60           9           4       East
196      09:59        7.60           8           1       West
197      09:59        7.60          51           3       West
198      09:59        7.59          20          15       East

[199 rows x 5 columns]
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
1

Your regex needs a tweak, `r"[ \t]+" selects any length of spaces and tabs (1 or greater). Additionally, pandas uses the first line of the file to determine how many columns there are. Your example starts with 4 columns and then adds another later on. That's too late - pandas has already created 4 element rows. You can solve that by supplying your own column names, letting pandas know how many there really are. In this example I'm just using integers but you could give them more useful names.

df = pd.read_csv('Water level.txt' ,  sep=r'[ \t]', encoding='GBK',
   engine='python', names=range(5))
tdelaney
  • 73,364
  • 6
  • 83
  • 116