2

I have just started delving into the world of Pandas, and the first strange CSV file I've found is one where there are two lines of comments (with different column widths) right at the beginning.

sometext, sometext2
moretext, moretext1, moretext2
*header*
actual data ---
---------------

I know how to skip these lines with skiprows or header=, but, instead, how would I retain these comments while using read_csv? Sometimes comments are necessary as file meta information, and I do not want to throw them away.

halfer
  • 19,824
  • 17
  • 99
  • 186
Coolio2654
  • 1,589
  • 3
  • 21
  • 46
  • Is there a file specification that states CSV files have comments or any metadata? Just read the two lines into a separate variable – OneCricketeer Feb 10 '18 at 18:08
  • Well, what you imported as raw data can always be kept. IIUC you might be better using `iloc[some_row:]` and creating a copy of the DF for the rest of your calculations. Not the most memory-efficient way but it depends on your specific problem. – roganjosh Feb 10 '18 at 18:10
  • @roganjosh Could you please elaborate more on `iloc[some_row:]` to extract the raw data? – Coolio2654 Feb 10 '18 at 18:16
  • @Coolio2654, if one of the below solutions helped feel free to accept one (tick on left). this will help other users with the same issue. – jpp Feb 11 '18 at 17:42

2 Answers2

2

Pandas is designed to read structured data.

For unstructured data, just use the built-in open:

with open('file.csv') as f:
    reader = csv.reader(f)
    row1 = next(reader)  # gets the first line
    row2 = next(reader)  # gets the second line

You can attach strings to the dataframe like this:

df.comments = 'My Comments'

But note:

Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Ok, that is informative, I'll make sure I understand the basic file IO then. How could I re-import these extracted comments into my final pandas frame, then? Preferably at the top? – Coolio2654 Feb 10 '18 at 18:15
  • It's not clear what you mean. If the columns in the first 2 rows align, use `pd.read_csv` and don't skip them. If they don't align, how do you intend to "reimport into final dataframe"? For metadata, see [Adding meta-information/metadata to pandas DataFrame](https://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe). – jpp Feb 10 '18 at 18:17
  • I just want to somehow include these comments in my pandas object as explicitly comments, and not part of the regular data, occupying a special status like the column names. So now I was thinking that I could use your code A) to extract the comments, B) feed everything in the csv after the comments into pandas, C) append the comments somehow into the pandas object. – Coolio2654 Feb 10 '18 at 18:27
  • @Coolio2654, see my update, it's possible but with a massive disclaimer. – jpp Feb 10 '18 at 18:44
  • This seems to work as well as I can expect at this point, since I am asking about a non-orthodox feature for Pandas. My single last question is whether whatever is in `df.comments` will be included if the file is saved as a csv again. – Coolio2654 Feb 11 '18 at 00:44
  • No, when you save as csv the information is lost. Your only option here is to either pickle (serialised python format) or move to a specialist format such as HDF5, which allows you to save dataset attributes. But if the only reason for this is storing comments it seems overkill. – jpp Feb 11 '18 at 00:52
2

You can read first metadata and then use read_csv:

with open('f.csv') as file:
    #read first 2 rows to metadata
    header = [file.readline() for x in range(2)]
    meta = [value.strip().split(',') for value in header]
    print (meta)
    [['sometext', ' sometext2'], ['moretext', ' moretext1', ' moretext2']]

    df = pd.read_csv(file)
    print (df)

          *header*
    0  actual data
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • While I chose the other answer as the final one, bec. jp cleared up for me that it is *definitively* impossible to include comment lines in pandas, and showed me a temporary solution via `df.comments`, this answer helped me extract those comments in the first place. Thanks, jezrael. – Coolio2654 Feb 12 '18 at 01:21