2

I am working with the GDELT dataset am having issues creating a pandas DataFrame using pd.DataFrame.from_csv(path_to_data, sep=",") which seems to load the data fine except except for the fact that the first header column is shifted to row 1 like so:

enter image description here

The arrow indicates where Source should be. Here is a snippet of the raw data in CSV format:

Source,Actor1Type1Code,Actor1Type2Code,Actor1Geo_CountryCode,Target,Actor2Type1Code,Actor2Type2Code,Actor2Geo_CountryCode,EventCode,f0_
PRINCE,GOV,,CA,CITIZEN,CVL,,CA,051,61
MEDIA,MED,,CA,MINIST,GOV,,CA,090,39
SUPREME COURT,JUD,,CA,DOCTOR,HLH,,CA,060,31
POLICE,COP,,CA,TORONTO,,,CA,173,31
PUBLISHER,MED,,CA,BUSINESS,BUS,,CA,010,29
HOSPITAL,HLH,,CA,POLICE,COP,,CA,043,28
HOSPITAL,HLH,,CA,TORONTO,,,CA,043,26
POLICE,COP,,CA,HOSPITAL,HLH,,CA,042,26
PRIME MINISTER,GOV,,CA,GERMANY,,,FR,042,22

Thanks!

Calvin

blong
  • 2,815
  • 8
  • 44
  • 110
cdlm
  • 565
  • 9
  • 20
  • Yes, I know. While I see that this is exactly the problem, I won't be able to actually test it until later. But I will go ahead and accept for now. Thanks for the response! – cdlm May 07 '15 at 19:49
  • I'm sure that this is the problem as the docs :http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html#pandas.DataFrame.from_csv show that the default value for `index_col` is `0` so it's treating your first col as the index, setting this to `None` or using `pd.read_csv` will work fine – EdChum May 07 '15 at 19:51
  • Indeed, that was the problem. – cdlm May 07 '15 at 21:03

1 Answers1

3

Don't use from_csv it's no longer maintained, use read_csv:

In [244]:

t="""Source,Actor1Type1Code,Actor1Type2Code,Actor1Geo_CountryCode,Target,Actor2Type1Code,Actor2Type2Code,Actor2Geo_CountryCode,EventCode,f0_
PRINCE,GOV,,CA,CITIZEN,CVL,,CA,051,61
MEDIA,MED,,CA,MINIST,GOV,,CA,090,39
SUPREME COURT,JUD,,CA,DOCTOR,HLH,,CA,060,31
POLICE,COP,,CA,TORONTO,,,CA,173,31
PUBLISHER,MED,,CA,BUSINESS,BUS,,CA,010,29
HOSPITAL,HLH,,CA,POLICE,COP,,CA,043,28
HOSPITAL,HLH,,CA,TORONTO,,,CA,043,26
POLICE,COP,,CA,HOSPITAL,HLH,,CA,042,26
PRIME MINISTER,GOV,,CA,GERMANY,,,FR,042,22"""
df = pd.read_csv(io.StringIO(t))
df
Out[244]:
           Source Actor1Type1Code  Actor1Type2Code Actor1Geo_CountryCode  \
0          PRINCE             GOV              NaN                    CA   
1           MEDIA             MED              NaN                    CA   
2   SUPREME COURT             JUD              NaN                    CA   
3          POLICE             COP              NaN                    CA   
4       PUBLISHER             MED              NaN                    CA   
5        HOSPITAL             HLH              NaN                    CA   
6        HOSPITAL             HLH              NaN                    CA   
7          POLICE             COP              NaN                    CA   
8  PRIME MINISTER             GOV              NaN                    CA   

     Target Actor2Type1Code  Actor2Type2Code Actor2Geo_CountryCode  EventCode  \
0   CITIZEN             CVL              NaN                    CA         51   
1    MINIST             GOV              NaN                    CA         90   
2    DOCTOR             HLH              NaN                    CA         60   
3   TORONTO             NaN              NaN                    CA        173   
4  BUSINESS             BUS              NaN                    CA         10   
5    POLICE             COP              NaN                    CA         43   
6   TORONTO             NaN              NaN                    CA         43   
7  HOSPITAL             HLH              NaN                    CA         42   
8   GERMANY             NaN              NaN                    FR         42   

   f0_  
0   61  
1   39  
2   31  
3   31  
4   29  
5   28  
6   26  
7   26  
8   22  

Or pass param index_col=None:

df = pd.DataFrame.from_csv(io.StringIO(t), index_col=None)

so it doesn't interpret the first column as an index column

EdChum
  • 376,765
  • 198
  • 813
  • 562