1

I have an example CSV with 1 column in row 1 and 2 columns in the other rows. The parser in Polars read_csv only recognizes 1 column. How do I force it to read more columns? I cannot simply use skiprows because sometimes more than the first row is a single column. I know Pandas can get around this with the names parameter, but I need to use Polars for speed. Any help would be appreciated.

CSV contents:

Data
Date,A
Time,B

Code:

import polars as pl
dumpdf = pl.read_csv('example.csv', has_header=False)
print(dumpdf)

Current and desired output

Dogbert
  • 212,659
  • 41
  • 396
  • 397
Josh Y.
  • 11
  • 2
  • Have you read the documentation? https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html – alec_djinn Jul 09 '23 at 08:06
  • Hi, welcome to StackOverflow. Please take the [tour](https://stackoverflow.com/tour) and learn [How to Ask](https://stackoverflow.com/help/how-to-ask). In order to get help, you will need to provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). If your question includes a pandas dataframe, please provide a [reproducible pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – alec_djinn Jul 09 '23 at 08:08
  • 1
    I'm not sure if there is anything you can do here as the data is "not csv". I think you would need a pre-processing step to add commas, or find the first line with a comma to use as a marker. – jqurious Jul 09 '23 at 08:34

3 Answers3

0

As others have said, CSV files are rectangular- if there is one column in one row, and 50 in another, then it's not a valid CSV. That said, if you have a folder of malformed CSV files you want to load, there are a few things you could do, if we say we must use polars:

  • edit the CSV files, to make them valid - in this case, a reasonably fast option could be to read in all of the rows as a list, add as many commas as is needed to make each row have n columns, where n is the max number of columns in the CSV, and then save the CSV back to disk. Then read in the csv with polars
  • assuming that the format of the csv file is something like:
blah
blah
blah
Date,A
Time,B

you could loop through each line, file the first line with a comma, and then use that as an input to read_csv (Credit: jqurious)

import polars as pl

path = "example.csv"
with open(path, 'r') as file:
    i = 0
    while True:
    # if there is a comma in line, break
        line = file.readline()
        if ',' in line:
            break
        i += 1
    df = pl.read_csv(path, skip_rows=i, has_header=False)
  • use a different library - I'm not aware of polars being able to read in a CSV file line by line, which is essentially what you're asking to do. You could try switching to another library which has this as an option
Mark
  • 7,785
  • 2
  • 14
  • 34
0

Perhaps there are better ways, but the idea of a pre-processing step was something like:

import tempfile
import polars as pl

notcsv = tempfile.NamedTemporaryFile()
notcsv.write(b"""
Data
More
Data
Date,A
Time,B
Other,"foo
bar"
""".strip()
)
notcsv.seek(0)

def my_read_csv(filename):
    with open(filename, "rb") as f:
        lines = b""
        for line in f:
            if b"," in line:
                df = pl.concat([
                    pl.read_csv(lines, has_header=False),
                    pl.read_csv(b"".join((line, *(line for line in f))), has_header=False)
                ])
                return df
            else:
                lines += line[:line.rfind(b"\n")] + b",\n"
>>> my_read_csv(notcsv.name)
shape: (6, 2)
┌──────────┬──────────┐
│ column_1 ┆ column_2 │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ Data     ┆ null     │
│ More     ┆ null     │
│ Data     ┆ null     │
│ Date     ┆ A        │
│ Time     ┆ B        │
│ Other    ┆ foo      │
│          ┆ bar      │
└──────────┴──────────┘ 
jqurious
  • 9,953
  • 1
  • 4
  • 14
0

You can set the separator parameter to the null character and then split the single resultant column yourself like this...

(
pl.read_csv('./sostream/example.csv',has_header=False,separator=chr(0000))
    .select(
        a=pl.col('column_1')
                .str.split(',')
                .list.to_struct(
                    n_field_strategy='max_width',
                    fields=lambda x:f"column_{x+1}"
                )
    )
    .unnest('a')
)

shape: (3, 2)
┌──────────┬──────────┐
│ column_1 ┆ column_2 │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ Data     ┆ null     │
│ Date     ┆ A        │
│ Time     ┆ B        │
└──────────┴──────────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72