0

I have a .csv file which I want to open and ultimately save it as a pandas dataframe. This file has some junk text above the data frame per se, whose header starts at the string Sample_ID. I wrote a code which does the job in multiple steps, and I am now wondering if there's a more elegant way to do so. Here's my code

import pandas as pd
import re
from io import StringIO

with open('SampleSheet.csv') as f:
    ## read in the .csv file as a string
    step1 = f.read()

    ## subset the step1 file
    # define where my df should start
    start = 'Sample_ID'
    step2 = step1[step1.index(start):]

    ## read in step2 as a pandas dataframe with stringio
    step3 = pd.read_csv(StringIO(step2))

I was wondering if there's a way to slice directly with f.read(), such that I would have one step less already.

I also tried to use pd.read_csv() with skiprows, but I am having a hard time in assigning the row number which starts with Sample_ID

BCArg
  • 2,094
  • 2
  • 19
  • 37

1 Answers1

-1

You can import and read in the file using only read_csv() as follows:

df  =   pd.read_csv('SampleSheet.csv', header=3)

where header is the number of lines you want to skip at the top of the file before your data set starts.

Ch3mZ_ZA
  • 7
  • 1
  • The thing is that the number of junk lines vary from the different files that I have, that's why I have to look for the pattern `Sample_ID` – BCArg Dec 21 '18 at 13:09
  • None of them will do the job, as the amount of junk lines (which affects both `header` and `skiprows`) in all the `SampleSheet.csv` that I want to read in vary – BCArg Dec 21 '18 at 13:12