0

I'm new to Python. I have a CSV-file with tweet entries formatted like this:

15,Oct 11,785816454042124288,/realDonaldTrump/status/785816454042124288,False,"Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!",DonaldTrump

and another

16,Oct 10,785563318652178432,/realDonaldTrump/status/785563318652178432,False,"Wow, @CNN got caught fixing their ""focus group"" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!",DonaldTrump

In Python, I load the contents using Pandas like this:

data = pd.read_csv(arg, sep=',')

Now, I would like to clean the CSV-file and only save the user ID (3rd entry on each row) and the tweet itself (I think 6th row). As you see I split by using the sep=','. The problem is if some tweets contains commas, I don't want this character to be removed due to the splitting.. If only the separator between tweet number, date, user_id, and so on, would have been something other than comma, it would have been a lot easier. Any suggestions on how to do this? I just want a new CSV-file without the information that I don't need.

  • Possible duplicate [Dealing with commas in a CSV file](http://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file) – Priyank Mar 11 '17 at 16:55
  • Thanks Priyank, but I would like to know if there is a way of dealing with this in Python aswell. I think C# would be easy in this case.. but I want to learn everything in Python too. – Great Cubicuboctahedron Mar 11 '17 at 16:56
  • 1
    Pandas won't split on the `,` because it's between `"` and the `"`'s within the `"`s are escaped anyway... so, not quite sure what your concern is here... – Jon Clements Mar 11 '17 at 16:59
  • @JonClements wow is that true? In that case my question is really stupid. I realize now that you are right.. Not all rows do have ""s – Great Cubicuboctahedron Mar 11 '17 at 17:01
  • Just look at the data loaded... you'll see it's fine... The quotes are only required if the field delimiter appears within a column... – Jon Clements Mar 11 '17 at 17:02
  • @JonClements you are right, any ideas on how to throw away the parts that I don't want? Can I access by index? – Great Cubicuboctahedron Mar 11 '17 at 17:03
  • @GreatCubicuboctahedron start [here](http://pandas.pydata.org/pandas-docs/stable/10min.html) – Jon Clements Mar 11 '17 at 17:07

1 Answers1

0

The problem is if some tweets contains commas, I don't want this character to be removed due to the splitting..

The regular Python standard library CSV module handles this case rather well:

>>> import csv
>>> s = '''15,Oct 11,785816454042124288,/realDonaldTrump/status/785816454042124288,False,"Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!",DonaldTrump
16,Oct 10,785563318652178432,/realDonaldTrump/status/785563318652178432,False,"Wow, @CNN got caught fixing their ""focus group"" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!",DonaldTrump
'''.splitlines()
>>> for fields in csv.reader(s):
        print(fields[2], fields[5])


785816454042124288 Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!
785563318652178432 Wow, @CNN got caught fixing their "focus group" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485