1

I have some csv files I want to read, which for whatever reason is formatted like this

A B C
1 3 1
2 2 2
3 1 3


D
1
2
3

The problem here is that column D is below the other columns, and this makes Pandas very unhappy, once it finishes reading column A, and dives straight into D's column name string.

I can of course read it like

pd.read_csv(file, skiprows=1, nrows = rows_in_A_B_C)

Basically, nrows = length_of_A_B_C. Problem is, I don't know the number of rows before D, and I can't read the csv until I do.

How can I solve this? Can I stop reading rows based on a condition instead, such as when I hit the header for D?

komodovaran_
  • 1,940
  • 16
  • 44

1 Answers1

0

A possible answer was already posted in the comments to the original post, but I still felt that they were needlessly hard to think up on the fly, for a rather simple task (or maybe I'm just bad, ha). In my case, I figured that the best method was as follows:

df = pd.read_csv(file, dtype = "str", names = ["A","B","C"])

Now Pandas will fill out all the empty bottom rows with NaN, and this happens to mark all the rows that column D is contained in. All these rows can then be thrown away:

df = df[df["A"].str.contains("NaN") == False]

And because we need it as a numeric dataframe,

df = df.apply(pd.to_numeric)

And now this can be used for skipping rows for parsing D only:

D_only = pd.read_csv(file, skiprows = len(df["A"])

And concatenated with pd.concat(df, D_only), axis=1)

Disclaimer: I don't know how efficient this is, computationally.

komodovaran_
  • 1,940
  • 16
  • 44