Numpy Separating CSV into columns

Question

I'm trying to use a CSV imported from bballreference.com. But as you can see, the separated values are all in one row rather than separated by columns. On NumPy Pandas, what would be the easiest way to fix this? I've googled to no avail.

csv on jupyter

I don't know how to post CSV file in a clean way but here it is:

",,,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Shooting,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game"
"Rk,Player,Age,G,GS,MP,FG,FGA,3P,3PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,MP,PTS,TRB,AST,STL,BLK"
"1,Kevin Durant\duranke01,29,5,5,182,54,107,9,28,22,27,3,34,37,24,7,6,10,7,139,.505,.321,.815,36.5,27.8,7.4,4.8,1.4,1.2"
"2,Klay Thompson\thompkl01,27,5,5,183,38,99,12,43,11,11,3,29,32,9,1,2,6,11,99,.384,.279,1.000,36.7,19.8,6.4,1.8,0.2,0.4"
"3,Stephen Curry\curryst01,29,4,3,125,32,67,15,34,19,19,2,19,21,14,8,2,15,6,98,.478,.441,1.000,31.2,24.5,5.3,3.5,2.0,0.5"
"4,Draymond Green\greendr01,27,5,5,186,27,55,8,20,12,15,12,47,59,50,12,8,18,16,74,.491,.400,.800,37.1,14.8,11.8,10.0,2.4,1.6"
"5,Andre Iguodala\iguodan01,34,5,4,140,14,29,4,12,7,12,4,21,25,17,10,2,3,7,39,.483,.333,.583,27.9,7.8,5.0,3.4,2.0,0.4"
"6,Quinn Cook\cookqu01,24,4,0,58,12,27,0,10,6,8,1,8,9,4,1,0,2,4,30,.444,.000,.750,14.4,7.5,2.3,1.0,0.3,0.0"
"7,Kevon Looney\looneke01,21,5,0,113,12,17,0,0,4,8,10,19,29,5,4,1,2,17,28,.706,,.500,22.6,5.6,5.8,1.0,0.8,0.2"
"8,Shaun Livingston\livinsh01,32,5,0,79,11,27,0,0,4,4,0,6,6,12,0,1,3,9,26,.407,,1.000,15.9,5.2,1.2,2.4,0.0,0.2"
"9,David West\westda01,37,5,0,40,8,14,0,0,0,0,2,5,7,13,2,4,3,4,16,.571,,,7.9,3.2,1.4,2.6,0.4,0.8"
"10,Nick Young\youngni01,32,4,2,41,3,11,3,10,2,3,0,4,4,1,1,0,1,3,11,.273,.300,.667,10.2,2.8,1.0,0.3,0.3,0.0"
"11,JaVale McGee\mcgeeja01,30,3,1,19,3,8,0,1,0,0,4,2,6,0,0,1,0,2,6,.375,.000,,6.2,2.0,2.0,0.0,0.0,0.3"
"12,Zaza Pachulia\pachuza01,33,2,0,8,1,2,0,0,2,4,4,2,6,0,2,0,1,1,4,.500,,.500,4.2,2.0,3.0,0.0,1.0,0.0"
"13,Jordan Bell\belljo01,23,4,0,23,1,4,0,0,1,2,1,5,6,5,2,2,0,2,3,.250,,.500,5.8,0.8,1.5,1.3,0.5,0.5"
"14,Damian Jones\jonesda03,22,1,0,3,0,1,0,0,2,2,0,0,0,0,0,0,0,0,2,.000,,1.000,3.2,2.0,0.0,0.0,0.0,0.0"
",Team Totals,26.5,5,,1200,216,468,51,158,92,115,46,201,247,154,50,29,64,89,575,.462,.323,.800,240.0,115.0,49.4,30.8,10.0,5.8"

Welcome to SO. Please take the time to read [ask] and the other links on that page. [Please don't post images of code or data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557) - copy the text (minimal amout), paste it into the question and [format it as code](https://stackoverflow.com/help/formatting). — wwii, May 30 '18 at 23:20
Also see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for advice on how to create a great pandas question. — jpp, May 30 '18 at 23:21

Logstar · Answer 1 · 2018-05-31T01:04:26.530

2

It seems that the first two rows of your CSV file are headers, but the default behavior of pd.read_csv thinks that only the first row is header.

Also, the beginning and trailing quotes make pd.read_csv think the text in between is a single field/column.

You could try the following:

Remove the beginning and trailing quotes, and

bbal = pd.read_csv('some_file.csv', header=[0, 1], delimiter=',')

Following is how you could use Python to remove the beginning and trailing quotes:

# open 'quotes.csv' in read mode with variable in_file as handle
# open 'no_quotes.csv' in write mode with variable out_file as handle
with open('quotes.csv') as in_file, open('no_quotes.csv', 'w') as out_file:
    # read in_file line by line
    # the variable line stores each line as string
    for line in in_file:
        # line[1:-1] slices the string to omit the first and last character
        # append a newline character '\n' to the sliced line
        # write the string with newline to out_file
        out_file.write(line[1:-1] + '\n')

# read_csv on 'no_quotes.csv'
bbal = pd.read_csv('no_quotes.csv', header=[0, 1], delimiter=',')
bbal.head()

edited May 31 '18 at 01:04

answered May 30 '18 at 23:48

Logstar

442
6
11

Thank you that fixed the header separation issue. But how would I use the separated comma values within each row and place them in individual columns? – NimbleTortoise May 30 '18 at 23:50
@NimbleTortoise Then it's not the header issue. Could you paste the full csv file content in the question? It would be very helpful to figure out how to separate values into individual columns. – Logstar May 30 '18 at 23:57
I just added it. Although it's a big block of text and there's probably a better way to post this. – NimbleTortoise May 31 '18 at 00:07
@NimbleTortoise Thanks. Could you try specifying delimiter=',' in pd.read_csv. I removed the beginning and tailing quotes and moved each row into a line. Then, I am able to read the text into individual columns without specifying delimiter=','. – Logstar May 31 '18 at 00:13
@NimbleTortoise I think removing the beginning and trailing quotes would fix your problem, if they are in your file. I thought they were introduced by copy-and-paste. – Logstar May 31 '18 at 00:20
Oh I removed all the "" and that does seem to put everything into columns. I wonder why they were there in the first place. Thanks for the big help. Do you think there's a shortcut to removing these for bigger files? – NimbleTortoise May 31 '18 at 00:42
@NimbleTortoise Thanks for trying that. If you have quotes at the beginning and the end of each line, adding delimiter=',' will not help, since non-empty text between a pair of quotes is considered as a single column in CSV format. – Logstar May 31 '18 at 00:48
@NimbleTortoise For larger file you could use Python read the file line-by-line into strings, slice the text between the beginning and trailing quotes, and write the sliced strings line-by-line into another file. I will write a snippet in my answer. – Logstar May 31 '18 at 00:52

score 0 · Answer 2 · answered May 31 '18 at 03:17

Consider reading in csv as a text file to be stripped of the beginning/end quotes per line on a text file read which tell the parser all data between is a singular value. And use built-in StringIO to read text string into dataframe instead of saving to disk for import.

Additionally, skip the first row of repeated Totals and Per Game and even the last row that aggregates since you can do that with pandas.

from io import StringIO
import pandas as pd

with open('BasketballCSVQuotes.csv') as f:
    csvdata = f.read().replace('"', '')

df = pd.read_csv(StringIO(csvdata), skiprows=1, skipfooter=1, engine='python')

print(df)

Output

      Rk                      Player   Age  G   GS    MP   FG  FGA  3P  3PA  ...    PTS    FG%    3P%    FT%   MP.1  PTS.1  TRB.1  AST.1  STL.1  BLK.1
0    1.0      Kevin Durant\duranke01  29.0  5  5.0   182   54  107   9   28  ...    139  0.505  0.321  0.815   36.5   27.8    7.4    4.8    1.4    1.2
1    2.0     Klay Thompson\thompkl01  27.0  5  5.0   183   38   99  12   43  ...     99  0.384  0.279  1.000   36.7   19.8    6.4    1.8    0.2    0.4
2    3.0     Stephen Curry\curryst01  29.0  4  3.0   125   32   67  15   34  ...     98  0.478  0.441  1.000   31.2   24.5    5.3    3.5    2.0    0.5
3    4.0    Draymond Green\greendr01  27.0  5  5.0   186   27   55   8   20  ...     74  0.491  0.400  0.800   37.1   14.8   11.8   10.0    2.4    1.6
4    5.0    Andre Iguodala\iguodan01  34.0  5  4.0   140   14   29   4   12  ...     39  0.483  0.333  0.583   27.9    7.8    5.0    3.4    2.0    0.4
5    6.0         Quinn Cook\cookqu01  24.0  4  0.0    58   12   27   0   10  ...     30  0.444  0.000  0.750   14.4    7.5    2.3    1.0    0.3    0.0
6    7.0      Kevon Looney\looneke01  21.0  5  0.0   113   12   17   0    0  ...     28  0.706    NaN  0.500   22.6    5.6    5.8    1.0    0.8    0.2
7    8.0  Shaun Livingston\livinsh01  32.0  5  0.0    79   11   27   0    0  ...     26  0.407    NaN  1.000   15.9    5.2    1.2    2.4    0.0    0.2
8    9.0         David West\westda01  37.0  5  0.0    40    8   14   0    0  ...     16  0.571    NaN    NaN    7.9    3.2    1.4    2.6    0.4    0.8
9   10.0        Nick Young\youngni01  32.0  4  2.0    41    3   11   3   10  ...     11  0.273  0.300  0.667   10.2    2.8    1.0    0.3    0.3    0.0
10  11.0      JaVale McGee\mcgeeja01  30.0  3  1.0    19    3    8   0    1  ...      6  0.375  0.000    NaN    6.2    2.0    2.0    0.0    0.0    0.3
11  12.0     Zaza Pachulia\pachuza01  33.0  2  0.0     8    1    2   0    0  ...      4  0.500    NaN  0.500    4.2    2.0    3.0    0.0    1.0    0.0
12  13.0           Jordan Belelljo01  23.0  4  0.0    23    1    4   0    0  ...      3  0.250    NaN  0.500    5.8    0.8    1.5    1.3    0.5    0.5
13  14.0      Damian Jones\jonesda03  22.0  1  0.0     3    0    1   0    0  ...      2  0.000    NaN  1.000    3.2    2.0    0.0    0.0    0.0    0.0

[14 rows x 30 columns]

Numpy Separating CSV into columns

2 Answers2