1

I inherited a few hundred CSVs I'd like to import into pandas dataframes. They are formatted like so:

username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281

To pull that into a pandas dataframe, I tried:

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)

and got this error:

ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11

I assume that's because there's an unescaped quote inside the field

ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...

So, I tried

tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)

and get a new error (I assume because there are ; in the field):

Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5

ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11

I can't regenerate these CSV files. What I'm wondering is, how can I preprocess/fix them so that they are properly formatted (i.e., escape quotes within fields)? Or, is there a way to read them into a dataframe directly even with unescaped quotes?

Libby
  • 581
  • 2
  • 6
  • 21
  • What version of python and pandas are you using? I am getting different results with Python 3.6.1 and pandas 0.19.2 – Raquel Guimarães Jun 15 '17 at 22:24
  • Python 3.5.3 pandas 0.20.2 - what happens for you? – Libby Jun 16 '17 at 00:27
  • For this case, I don't need every column, and add `usecols` fixed my immediate problem. It doesn't answer my actual question though. Here's the line that worked: `tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE, usecols=["date","hashtags","permalink"])` – Libby Jun 16 '17 at 01:12

1 Answers1

-1

I would clean the data before reading into pandas. Here is my solution to your current problem.

Edited:
This will replace ; within double quotes (based on this answer)

o = open("fileOut.csv", 'w')
with open("fileIn.txt") as f:
   for lines in f:
      o.write(re.sub('\"[^]]*\"', lambda x:x.group(0).replace(';',''), lines))
o.close()

Original:

o = open("fileOut.csv", 'w')
with open("fileIn.txt") as f:
    for lines in f:
        o.write(lines.replace("; ", ""))
o.close()
ramesh
  • 1,187
  • 7
  • 19
  • 42
  • The ; in tweets isn't always followed by a space, so this works only for some. e.g. `;2013-07-15 15:35;1;0;"@CongressionalPhotoADay 15 - Something beautiful: View from the Speaker's balcony of the United States Capitol;... http:// fb.me/2ZHDzR8XQ";;@CongressionalPhotoADay;;"356874563839201280";https://twitter.com/AustinScottGA08/status/356874563839201280` – Libby Jun 16 '17 at 00:50
  • 1
    @Libby: In that case, use a regex like https://stackoverflow.com/a/11096811/2204131. `re.sub('\"[^]]*\"', lambda x:x.group(0).replace(';','\;'), lines)` will replace `;` within quotes. – ramesh Jun 16 '17 at 17:46