I currently have a csv file of twitter users that I am trying to convert to a python dataframe This file contains a users id, background image, a bio description as well as other information about the user profile. The problem I am having is that I am trying to get rid of commas in the bio description as this is interfering with the comma separation of the file and creates extra columns in the dataframe.
All of the fields in the file are enclosed in parenthesis and each subfield is enclosed with quotes. Most fo the time these are single quotes but the bio description sometimes has double quotes. See below for an example of a line from this file
(19435878, u'http://a3.xxx.com/profile_background_images.jpg', 1232785000000L, u'I have been researching a British Spies life for a few years. My site tells his story. His name's ffrench, Conrad ffrench. ', 0, 753, 837, 0, u'Lincolnshire', u'Joe Bloggs', u'http://a0.xxx.com/profile_images.jpg', 0, u'00bloggs', 10, u'London', 0, u'en', 3, u'http://sample.com')
What I am trying to do is look for the commas in the bio description by using regex. It works fine when the description is in a double quote, but when I edit the regex for single quotes it does not work if there is an abbreviation in there like "I'm".
Here is my regex for the double quote which finds the commas in double quotes (which I found here)
,(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$)
What seems to be happening is that the "name's" is throwing off the single quote count and so it picks up commas outside of the quotes and gets rid of the separator and some rows have less columns
I am trying to come up with a regex that could maybe use u'
as its starting point and maybe a ',
as the end point and only search between them, but I cant seem to get anywhere close. I'm a newbie to this regex stuff and I'm finding it hard to wrap my head around some of the more complex forms of it.
Any help is greatly appreciated.