How to use Python Polars read_csv when column length increases after row 1?

Question

I have an example CSV with 1 column in row 1 and 2 columns in the other rows. The parser in Polars read_csv only recognizes 1 column. How do I force it to read more columns? I cannot simply use skiprows because sometimes more than the first row is a single column. I know Pandas can get around this with the names parameter, but I need to use Polars for speed. Any help would be appreciated.

CSV contents:

Data
Date,A
Time,B

Code:

import polars as pl
dumpdf = pl.read_csv('example.csv', has_header=False)
print(dumpdf)

Current and desired output

Have you read the documentation? https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html — alec_djinn, Jul 09 '23 at 08:06
Hi, welcome to StackOverflow. Please take the [tour](https://stackoverflow.com/tour) and learn [How to Ask](https://stackoverflow.com/help/how-to-ask). In order to get help, you will need to provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). If your question includes a pandas dataframe, please provide a [reproducible pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — alec_djinn, Jul 09 '23 at 08:08
I'm not sure if there is anything you can do here as the data is "not csv". I think you would need a pre-processing step to add commas, or find the first line with a comma to use as a marker. — jqurious, Jul 09 '23 at 08:34

score 0 · Answer 1 · answered Jul 09 '23 at 08:55

As others have said, CSV files are rectangular- if there is one column in one row, and 50 in another, then it's not a valid CSV. That said, if you have a folder of malformed CSV files you want to load, there are a few things you could do, if we say we must use polars:

edit the CSV files, to make them valid - in this case, a reasonably fast option could be to read in all of the rows as a list, add as many commas as is needed to make each row have n columns, where n is the max number of columns in the CSV, and then save the CSV back to disk. Then read in the csv with polars
assuming that the format of the csv file is something like:

blah
blah
blah
Date,A
Time,B

you could loop through each line, file the first line with a comma, and then use that as an input to read_csv (Credit: jqurious)

import polars as pl

path = "example.csv"
with open(path, 'r') as file:
    i = 0
    while True:
    # if there is a comma in line, break
        line = file.readline()
        if ',' in line:
            break
        i += 1
    df = pl.read_csv(path, skip_rows=i, has_header=False)

use a different library - I'm not aware of polars being able to read in a CSV file line by line, which is essentially what you're asking to do. You could try switching to another library which has this as an option

score 0 · Answer 2 · answered Jul 09 '23 at 10:51

Perhaps there are better ways, but the idea of a pre-processing step was something like:

import tempfile
import polars as pl

notcsv = tempfile.NamedTemporaryFile()
notcsv.write(b"""
Data
More
Data
Date,A
Time,B
Other,"foo
bar"
""".strip()
)
notcsv.seek(0)

def my_read_csv(filename):
    with open(filename, "rb") as f:
        lines = b""
        for line in f:
            if b"," in line:
                df = pl.concat([
                    pl.read_csv(lines, has_header=False),
                    pl.read_csv(b"".join((line, *(line for line in f))), has_header=False)
                ])
                return df
            else:
                lines += line[:line.rfind(b"\n")] + b",\n"

>>> my_read_csv(notcsv.name)
shape: (6, 2)
┌──────────┬──────────┐
│ column_1 ┆ column_2 │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ Data     ┆ null     │
│ More     ┆ null     │
│ Data     ┆ null     │
│ Date     ┆ A        │
│ Time     ┆ B        │
│ Other    ┆ foo      │
│          ┆ bar      │
└──────────┴──────────┘

score 0 · Answer 3 · answered Jul 11 '23 at 13:38

You can set the separator parameter to the null character and then split the single resultant column yourself like this...

(
pl.read_csv('./sostream/example.csv',has_header=False,separator=chr(0000))
    .select(
        a=pl.col('column_1')
                .str.split(',')
                .list.to_struct(
                    n_field_strategy='max_width',
                    fields=lambda x:f"column_{x+1}"
                )
    )
    .unnest('a')
)

shape: (3, 2)
┌──────────┬──────────┐
│ column_1 ┆ column_2 │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ Data     ┆ null     │
│ Date     ┆ A        │
│ Time     ┆ B        │
└──────────┴──────────┘

How to use Python Polars read_csv when column length increases after row 1?

3 Answers3