2

So I have several csv files that represent some data, each of which may have different lines of initial comments

table_doi: 10.17182/hepdata.52402.v1/t7
name: Table 7
...
ABS(YRAP), < 0.1
SQRT(S) [GeV], 1960
PT [GEV], PT [GEV] LOW, PT [GEV] HIGH, D2(SIG)/DYRAP/DPT [NB/GEV]
67, 62, 72, 6.68
...
613.5, 527, 700, 1.81E-07

I would like to read in only the relevant data and their headers as well, which start from the line

PT [GEV], PT [GEV] LOW, PT [GEV] HIGH, D2(SIG)/DYRAP/DPT [NB/GEV]

Therefore the strategy I would think of is to find the pattern PT [GEV] and start reading from there.

However, I am not sure how to achieve this in Python, could anyone help me on that?

Thank you in advance!


By the way, the function I currently have is

import os
import glob
import csv

def read_multicolumn_csv_files_into_dictionary(folderpath, dictionary):
    filepath = folderpath + '*.csv'
    files = sorted(glob.glob(filepath))
    for file in files:
        data_set = file.replace(folderpath, '').replace('.csv', '')
        dictionary[data_set] = {}
        with open(file, 'r') as data_file:
            data_pipe = csv.DictReader(data_file)
            dictionary[data_set]['pt'] = []
            dictionary[data_set]['sigma'] = []
            for row in data_pipe:
                dictionary[data_set]['pt'].append(float(row['PT [GEV]']))
                dictionary[data_set]['sigma'].append(float(row['D2(SIG)/DYRAP/DPT [NB/GEV]']))
    return dictionary

which only works if I manually delete those initial comments in the csv files.

zyy
  • 1,271
  • 15
  • 25

5 Answers5

2

Assuming every file has a line that startswith PT [GEV]:

import os
import pandas as pd

...
csvs = []
for file in files:
    with open(file) as f:
        for i, l in enumerate(f):
            if l.startswith('PT [GEV]'):
                csvs.append(pd.read_csv(file, skiprows = i))
                break
df = pd.concat(csvs)
Chris
  • 29,127
  • 3
  • 28
  • 51
2

checkout startswith. Also, you can find detailed explanation here. https://cmdlinetips.com/2018/01/3-ways-to-read-a-file-and-skip-initial-comments-in-python/

Ruchit Dalwadi
  • 311
  • 1
  • 2
  • 13
  • I have read that webpage before, however, what I am trying to do is to find `PT [GEV]` and read everything beyond that point, while the lines beyond that may not necessarily start with `PT [GEV]`. – zyy Jan 15 '19 at 14:19
1

Try this where it will be searching for the row that contains PT [GEV] and if it finds the contains, it will change the m to be true and start to append the rest of date to the list :

import csv

contain= 'PT [GEV]'
List=[]
m=false
with open('Users.csv', 'rt') as f:
     reader = csv.reader(f, delimiter=',') 
     for row in reader:
          for field in row:
              if field == contain:
              m=true
          if m==true:
             List.append(row)            
I_Al-thamary
  • 3,385
  • 2
  • 24
  • 37
  • That worked! It would return a list of list, how do you think I could pass it to `csv.DictReader`? Or do you have a better idea manipulating it? – zyy Jan 16 '19 at 02:57
  • you can convert the list to dictionary easily see this https://stackoverflow.com/questions/6900955/python-convert-list-to-dictionary – I_Al-thamary Jan 16 '19 at 03:11
  • Thanks, I will try that, sorry for late response. – zyy Jan 24 '19 at 16:59
1

You can use the file.tell method to save the file pointer position while you read and skip the lines until you find the header line, at which point you can use the file.seek method to reset the file pointer back to the beginning of the header line so that csv.DictReader can parse the rest of the file as valid CSV:

with open(file, 'r') as data_file:
    while True:
        position = data_file.tell()
        line = next(data_file)
        if line.count(',') == 3: # or whatever condition your header line satisfies
            data_file.seek(position) # reset file pointer to the beginning of the header line
            break
    data_pipe = csv.DictReader(data_file)
    ...
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • That is a nice way of doing it! I have 20 entries for actual data, so I write `if line.count(',') == 19:` but I am getting back empty dictionary. So I tried to print `line` before the `if` statement and it does seem like the `while` loop is terminated once the program found the actual header line. Do you think the problem comes from passing `data_file` to `data_pipe`? – zyy Jan 16 '19 at 02:43
0

I would just create a help function to get your csv reader to the first record:

def remove_comments_from_file():

    file_name = "super_secret_file.csv"
    file = open(file_name, 'rU')

    csv_read_file = csv.reader(file)        

    for row in csv_read_file:
        if row[0] == "PT [GEV]"
            break

    return csv_read_file

Something along those lines, when the csv reader is returned, it will start at your first record (in this example - 67, 62, 72, 6.68)

Meeko
  • 85
  • 2
  • 6
  • Thank you for your answer, but how do I manipulate on the returned `csv_read_file`? I tried to pass it to the function but it would be recognized as a `NoneType`, then I tried to print it, but the only thing that is showed is `<_csv.reader object at 0x1190d44b0>`. – zyy Jan 16 '19 at 02:20
  • By the way, there are some empty lines that could cause problems, so I added a statement `if row != []:` that deals with this. – zyy Jan 16 '19 at 02:21