0

I'm trying to parse the following file:

student_id, 521, 597, 624, 100,
1, 99, 73, 97, 98,
2, 98, 71, 70, 99,

I have the following code:

def load_students(filename):
    exercises = []
    students = []
    grades = []
    fr = None
    try:
        fr = open(filename, 'r')
        for line in fr:
            tokens = line.strip('\n').split(',')

            # Get Exercises
            # Need help here

            # Get Students
            if tokens[0].isdigit():
                students.append(tokens[0])

            # Get grades
            grades.append([int(x) for x in tokens[1:]])
    except IOError:
        print("IO Error!")

    finally:
        if fr is not None:
            fr.close()
            print(exercises)
            print(students)
            print(grades)
        return np.array(exercises), np.array(students), np.array(grades)

How I can get the file header (521,597,624, 100) as an array excluding the student_id string?

Akshat Zala
  • 710
  • 1
  • 8
  • 23
TheUnreal
  • 23,434
  • 46
  • 157
  • 277
  • Since you are already using Numpy, did you try just [using Numpy functionality to read the CSV file](https://stackoverflow.com/questions/3518778/how-do-i-read-csv-data-into-a-record-array-in-numpy)? – Karl Knechtel Jun 28 '20 at 07:35
  • 1
    Whatever wrote that file was not CSV conformant. It should not have spaces after the commas. Those may be treated as valid column values by CSV parsers. – tdelaney Jun 28 '20 at 07:43

4 Answers4

2

Code:

def load_students(filename):
    exercises = []
    students = []
    grades = []
    fr = None
    try:
        fr = open(filename, 'r')
        for line in fr:
            tokens = [val.strip() for val in line.strip('\n').split(',') if val.strip()]

            # Get Exercises
            if tokens[0].isdigit() == False:
                exercises+=[int(x) for x in tokens[1:]]

            # Get Students
            if tokens[0].isdigit():
                students.append(tokens[0])

            # Get grades
            if tokens[0].isdigit():
                grades.append([int(x) for x in tokens[1:]])
    except IOError:
        print("IO Error!")

    finally:
        if fr is not None:
            fr.close()
            print(exercises)
            print(students)
            print(grades)


load_students("data.csv")

Output:

[521, 597, 624, 100]
['1', '2']
[[99, 73, 97, 98], [98, 71, 70, 99]]

Explanation:

I have stripped the white spaces in [val.strip() for val in line.strip('\n').split(',') if val.strip()].

Also I used the same logic you have included to identify the first line elements as exercise numbers (first character is not numeric).

arshovon
  • 13,270
  • 9
  • 51
  • 69
1

In terms of slotting in to your existing code, you could add an else clause to your if tokens[0].isdigit():

    for line in fr:
        tokens = line.strip('\n').split(',')

        if tokens[0].isdigit():
            # Get Students
            students.append(tokens[0])
            # Get grades
            grades.append([int(x) for x in tokens[1:] if x.strip().isdigit()])
        else:
            exercises = [int(x) for x in tokens[1:] if x.strip().isdigit()]

If you don't need the exercises values to be integers, just use

exercises = tokens[1:]

Also, if there might be other random data in the file, you could make the else be

elif tokens[0] == 'student_id'
Nick
  • 138,499
  • 22
  • 57
  • 95
  • But I need all of them, `students`, `grades`, and `exercises`. the `exercises` are the IDS on the header only, while `grades` are from the second line. – TheUnreal Jun 28 '20 at 07:37
  • @TheUnreal that's what this will do; if the first entry on the line is a digit it will grab the student and grades, otherwise (presumably first line only) it will grab the exercises – Nick Jun 28 '20 at 07:40
  • Thanks, not sure why but the grades are missing the last character of the last element (showing `9` instead of `98` – TheUnreal Jun 28 '20 at 09:07
  • @TheUnreal that is weird; if I just process a string line as `'1, 99, 73, 97, 98, '` it gets the expected result of `[[99, 73, 97, 98]]`. Note I have changed the answer slightly to deal with the trailing `, ` in the line. See https://rextester.com/EPRB39947 – Nick Jun 28 '20 at 09:17
  • Not sure why it's not working from the CSV, `[[99 73 97 9] [98 71 70 9]]` – TheUnreal Jun 28 '20 at 09:29
  • Can you verify what's in`line`? – Nick Jun 28 '20 at 09:50
  • It's weird that the answer you've accepted works - it's essentially exactly the same code. Anyway, I'm glad you've got a working solution so let's leave it at that. – Nick Jun 28 '20 at 11:53
1

Is this what you want?

import pandas as pd
def load_students(filename):
    df= pd.read_csv('data.csv')
    df.drop(columns = df.columns[-1], inplace=True)
    df.columns = [col.strip() for col in df.columns]
    exercises = df.columns[1:].to_numpy()
    students = df.student_id.to_numpy()
    grades = df.iloc[:, 1:].to_numpy()
    return exercises, students, grades
    
print(load_students('data.csv'))

Output:

(array(['521', '597', '624', '100'], dtype=object), array([1, 2]), array([[99, 73, 97, 98],
       [98, 71, 70, 99]]))
Balaji Ambresh
  • 4,977
  • 2
  • 5
  • 17
0

To process csv-files in python I would highly recommend to use pandas.

Here is an examlpe: Your file (slightly modificated, removed spaces in the header and commas at the end of the lines):

student_id,521,597,624,100
1, 99, 73, 97, 98
2, 98, 71, 70, 99

Code:

import pandas as pd
df = pd.read_csv(filename, index_col=0) # student_id becomes an index now
df.keys()

df.keys() returns a list of the header and will give you the desired result. You also can do other things much simpler:df['521'].values will give you a numpy array with the values of that column for example.

Tinu
  • 2,432
  • 2
  • 8
  • 20