Cleaning dataset in Python

Question

I'm new to Python. I have a CSV-file with tweet entries formatted like this:

15,Oct 11,785816454042124288,/realDonaldTrump/status/785816454042124288,False,"Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!",DonaldTrump

and another

16,Oct 10,785563318652178432,/realDonaldTrump/status/785563318652178432,False,"Wow, @CNN got caught fixing their ""focus group"" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!",DonaldTrump

In Python, I load the contents using Pandas like this:

data = pd.read_csv(arg, sep=',')

Now, I would like to clean the CSV-file and only save the user ID (3rd entry on each row) and the tweet itself (I think 6th row). As you see I split by using the sep=','. The problem is if some tweets contains commas, I don't want this character to be removed due to the splitting.. If only the separator between tweet number, date, user_id, and so on, would have been something other than comma, it would have been a lot easier. Any suggestions on how to do this? I just want a new CSV-file without the information that I don't need.

Possible duplicate [Dealing with commas in a CSV file](http://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file) — Priyank, Mar 11 '17 at 16:55
Thanks Priyank, but I would like to know if there is a way of dealing with this in Python aswell. I think C# would be easy in this case.. but I want to learn everything in Python too. — Great Cubicuboctahedron, Mar 11 '17 at 16:56
Pandas won't split on the `,` because it's between `"` and the `"`'s within the `"`s are escaped anyway... so, not quite sure what your concern is here... — Jon Clements, Mar 11 '17 at 16:59
@JonClements wow is that true? In that case my question is really stupid. I realize now that you are right.. Not all rows do have ""s — Great Cubicuboctahedron, Mar 11 '17 at 17:01
Just look at the data loaded... you'll see it's fine... The quotes are only required if the field delimiter appears within a column... — Jon Clements, Mar 11 '17 at 17:02
@JonClements you are right, any ideas on how to throw away the parts that I don't want? Can I access by index? — Great Cubicuboctahedron, Mar 11 '17 at 17:03
@GreatCubicuboctahedron start [here](http://pandas.pydata.org/pandas-docs/stable/10min.html) — Jon Clements, Mar 11 '17 at 17:07

score 0 · Answer 1 · answered Mar 11 '17 at 17:55

The problem is if some tweets contains commas, I don't want this character to be removed due to the splitting..

The regular Python standard library CSV module handles this case rather well:

>>> import csv
>>> s = '''15,Oct 11,785816454042124288,/realDonaldTrump/status/785816454042124288,False,"Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!",DonaldTrump
16,Oct 10,785563318652178432,/realDonaldTrump/status/785563318652178432,False,"Wow, @CNN got caught fixing their ""focus group"" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!",DonaldTrump
'''.splitlines()
>>> for fields in csv.reader(s):
        print(fields[2], fields[5])


785816454042124288 Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!
785563318652178432 Wow, @CNN got caught fixing their "focus group" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!

Cleaning dataset in Python

1 Answers1