-1

I currently have a csv file of twitter users that I am trying to convert to a python dataframe This file contains a users id, background image, a bio description as well as other information about the user profile. The problem I am having is that I am trying to get rid of commas in the bio description as this is interfering with the comma separation of the file and creates extra columns in the dataframe.

All of the fields in the file are enclosed in parenthesis and each subfield is enclosed with quotes. Most fo the time these are single quotes but the bio description sometimes has double quotes. See below for an example of a line from this file

(19435878, u'http://a3.xxx.com/profile_background_images.jpg', 1232785000000L, u'I have been researching a British Spies life for a few years. My site tells his story. His name's ffrench, Conrad ffrench. ', 0, 753, 837, 0, u'Lincolnshire', u'Joe Bloggs', u'http://a0.xxx.com/profile_images.jpg', 0, u'00bloggs', 10, u'London', 0, u'en', 3, u'http://sample.com')

What I am trying to do is look for the commas in the bio description by using regex. It works fine when the description is in a double quote, but when I edit the regex for single quotes it does not work if there is an abbreviation in there like "I'm".

Here is my regex for the double quote which finds the commas in double quotes (which I found here)

,(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$) 

What seems to be happening is that the "name's" is throwing off the single quote count and so it picks up commas outside of the quotes and gets rid of the separator and some rows have less columns

I am trying to come up with a regex that could maybe use u' as its starting point and maybe a ', as the end point and only search between them, but I cant seem to get anywhere close. I'm a newbie to this regex stuff and I'm finding it hard to wrap my head around some of the more complex forms of it.

Any help is greatly appreciated.

eeno
  • 45
  • 9

1 Answers1

1

I'm not sure if it's the commas causing the problem; perhaps more the single quote in a string starting and ending with a single quote. You mention double quotes can be in there too so I guess you cannot change the strings to be enclosed by double quotes and hope to fix the issue.

You could look to change the sep argument for the frame to something custom. See the example code below that uses the only line given in the question to create a 15 column frame.

Note: No desired output shown in the question ATM so guessing the below is correct.

Code:

import re

with open('so_69080377.txt', 'r') as f:
    lines = f.readlines()
    
new_lines = []
for line in lines:
    print(line,'#<<- one line in text file\n')
    # see https://regex101.com/r/vwOegA/1 for regex example
    new_lines.append(re.sub(r"(,) u'|('),\s", r',,, ', line))

s = ''.join(new_lines)

from io import StringIO
import pandas as pd

# custom 'sep=' that's looking for the replacement in line 9 above;
# in this case three commas together (if that's not unique...modify line 9 above)
display(pd.read_csv(StringIO(s), sep=',,,', header=None))

Output:

(19435878, u'http://a3.xxx.com/profile_background_images.jpg', 1232785000000L, u'I have been researching a British Spies life for a few years. My site tells his story. His name's ffrench, Conrad ffrench. ', 0, 753, 837, 0, u'Lincolnshire', u'Joe Bloggs', u'http://a0.xxx.com/profile_images.jpg', 0, u'00bloggs', 10, u'London', 0, u'en', 3, u'http://sample.com') #<<- one line in text file

enter image description here

MDR
  • 2,610
  • 1
  • 8
  • 18
  • Thanks for your reply MDR. Great Idea on changing the separators, don't know why that didn't occur to me. Yep you were correct in assuming the output. I'm currently using your solution above to try and get what I need, but the file is way messier than i thought. Ive used " \t " as the separator as I was getting errors about multi-char separators. – eeno Sep 07 '21 at 21:15