2

Updated Solution:

I have data with '|' separating some of the columns, i.e. it is not strictly csv. I have imported it as csv and attempting to strip the extra '|' of the values in specific columns. My data follows:

    import pandas as pd
from io import StringIO



dfy = pd.read_csv('Thesis/CRSP/CampaignFin14/pacs14.txt', header=0)

#Replace '|' in cells with series.str methods
for col in dfy:
    if dfy[col].dtype == 'object':
        dfy[col] = dfy[col].str.replace('|', '')

        dfy.head()



  |2014|  |4111920141231643319|  |C00206136|  |N00029285|  1000  05/15/2014  \
0   2014  |4021120141205164809|  |C00307397|  |N00026722|  5000  10/22/2013   
1   2014  |4053020141213944220|  |C00009985|  |N00030676|     4  03/26/2014   
2   2014  |4063020141216281752|  |C00104299|  |N00032088|  1000  05/06/2014   
3   2014  |4061920141215566782|  |C00164145|  |N00034277|  2500  05/22/2014   
4   2014  |4102420141226480432|  |C00439216|  |N00036023|  1000  09/29/2014   

For some reason the loop is not taking out the |

The following works but I would like to do all of the columns at once.

dfy['cycle'] = \
dfy['cycle'].map(lambda x: str(x)[:-1])
dfy.head()

    cycle   cid     amount  date    realcode    type    di  feccandid
0   |2014   |N00029285|     1000    05/15/2014  |E1600|     |24K|   |D|     |H8TX22107|
1   |2014   |N00026722|     5000    10/22/2013  |G4600|     |24K|   |D|     |H4TX28046|
2   |2014   |N00030676|     4   03/26/2014  |C2100|     |24Z|   |D|     |H0MO07113|

This is what my data looks like when I use .csv, sep=, to import.

    cycle   cid     amount  date    realcode    type    di  feccandid
0   |2014|  |N00029285|     1000    05/15/2014  |E1600|     |24K|   |D|     |H8TX22107|
1   |2014|  |N00026722|     5000    10/22/2013  |G4600|     |24K|   |D|     |H4TX28046|
2   |2014|  |N00030676|     4   03/26/2014  |C2100|     |24Z|   |D|     |H0MO07113|

This is what it looks like in .txt:

|2014|,|4111920141231643319|,|C00206136|,|N00029285|,1000,05/15/2014,|E1600|,|24K|,|D|,|H8TX22107|
|2014|,|4021120141205164809|,|C00307397|,|N00026722|,5000,10/22/2013,|G4600|,|24K|,|D|,|H4TX28046|
|2014|,|4053020141213944220|,|C00009985|,|N00030676|,4,03/26/2014,|C2100|,|24Z|,|D|,|H0MO07113|
|2014|,|4063020141216281752|,|C00104299|,|N00032088|,1000,05/06/2014,|F1100|,|24K|,|D|,|H0OH06189|
|2014|,|4061920141215566782|,|C00164145|,|N00034277|,2500,05/22/2014,|F3100|,|24K|,|D|,|H2NY22139|

Here is a link to my rawdata

Collective Action
  • 7,607
  • 15
  • 45
  • 60
  • 1
    You need to call `dfy[['cycle', 'cid']]`. You could find more information in [docs](http://pandas.pydata.org/pandas-docs/stable/indexing.html) – Anton Protopopov Mar 25 '16 at 11:23
  • Thank you for your comment. I updated the code above and am reading through the docs! New error `AttributeError: 'DataFrame' object has no attribute 'map'` – Collective Action Mar 25 '16 at 11:41

2 Answers2

2

Read in csv and use Series str operations like str.replace

import pandas as pd
from cStringIO import StringIO

# Fake csv text for example
textcsv = '''
cycle,cid,amount,date,realcode,type,di,feccandid
|2014|,|N00029285|,1000,05/15/2014,|E1600|,|24K|,|D|,|H8TX22107|
|2014|,|N00026722|,5000,10/22/2013,|G4600|,|24K|,|D|,|H4TX28046|
|2014|,|N00030676|,4   ,03/26/2014,|C2100|,|24Z|,|D|,|H0MO07113|
'''
# Read in fake csv
# normally you would use: dfy = pd.read_csv('/path/to/file.csv')
dfy = pd.read_csv(StringIO(textcsv))

# Replace '|' in cells with series.str methods
for col in dfy:
    if dfy[col].dtype == 'object':
        dfy[col] = dfy[col].str.replace('|', '')

print day

  cycle        cid  amount        date realcode type di  feccandid
0  2014  N00029285    1000  05/15/2014    E1600  24K  D  H8TX22107
1  2014  N00026722    5000  10/22/2013    G4600  24K  D  H4TX28046
2  2014  N00030676       4  03/26/2014    C2100  24Z  D  H0MO07113
tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
  • Thank you for the comment! I applied the code and got the foll0wing error: `ImportError Traceback (most recent call last) in () 1 import pandas as pd ----> 2 from cStringIO import StringIO 3 4 # Fake csv text for example 5 ImportError: No module named 'cStringIO' ` – Collective Action Mar 25 '16 at 12:26
  • 1
    if you are using python 3 you should use `io.StringIO` – tmthydvnprt Mar 25 '16 at 12:27
  • 1
    `from io import StringIO` – tmthydvnprt Mar 25 '16 at 12:27
  • Thank you again! I made the corrections and also edited the original question to include a link to my rawdata. I am now returning a empty df: `Empty DataFrame Columns: [Thesis/CRSP/CampaignFin14/pacs14.txt] Index: [] ` – Collective Action Mar 25 '16 at 12:33
  • I don't see an indication of which line; I ran the code and there are no errors but when I `dfy.head` it returns ` Thesis/CRSP/CampaignFin14/pacs14.txt` – Collective Action Mar 25 '16 at 12:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107331/discussion-between-tmthydvnprt-and-michael-perdue). – tmthydvnprt Mar 25 '16 at 12:38
1

You could preprocess the file in memory to remove all the | from the lines and pass that to the Pandas.

import io
import pandas as pd

with open('Thesis/CRSP/CampaignFin14/pacs14.txt', 'r') as fi:
    content = ''
    for line in fi:
        content += line.replace('|', '')

block = io.StringIO(content)
dfy2 = pd.read_csv(block, skipinitialspace=True, delim_whitespace=True)

Here you first create a buffer like object from a string with the io.StringIO() and then pass that to the pd.read_csv that accepts as first argument filename or buffer.

jure
  • 402
  • 5
  • 9
  • Thank you for your suggestion! I applied your solution and received the following error: `---> 10 dfy2 = pd.read_csv(block, skipinitialspace=True, delim_whitespace=True)` `CParserError: Error tokenizing data. C error: Expected 1 fields in line 300866, saw 2` – Collective Action Mar 25 '16 at 11:54
  • You would need to provide all to data for anyone to solve this – tmthydvnprt Mar 25 '16 at 11:56
  • 1
    Unless you want to use [`error_bad_lines` or `warn_bad_lines`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to skip these bad lines (options are all the way at the bottom of the page). – tmthydvnprt Mar 25 '16 at 11:57