I inherited a few hundred CSVs I'd like to import into pandas dataframes. They are formatted like so:
username;date;retweets;favorites;text;geo;mentions;hashtags;id;permalink
;2011-03-02 11:04;0;0;"ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...";;;;"42993734165594112";https://twitter.com/AustinScottGA08/status/42993734165594112
;2014-02-25 10:38;3;0;"Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable.";;;#IRS;"438352361812426752";https://twitter.com/AnderCrenshaw/status/438352361812426752
;2017-06-14 12:39;4;6;"Thank you to the brave men and women who have answered the call to defend our great nation. Happy 242nd Birthday @USArmy ! #ArmyBDay pic.twitter.com/brBYCOLBJZ";;@USArmy;#ArmyBDay;"875045042758369281";https://twitter.com/AustinScottGA08/status/875045042758369281
To pull that into a pandas dataframe, I tried:
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True)
and got this error:
ParserError: Error tokenizing data. C error: Expected 10 fields in line 1, saw 11
I assume that's because there's an unescaped quote inside the field
ICYMI: "What you have is 87 people who have common goals of working for [the] next generation; that’s why our...
So, I tried
tweets = pd.read_csv(file, header=0, sep=';', parse_dates = True, quoting=csv.QUOTE_NONE)
and get a new error (I assume because there are ; in the field):
Will be asking tough questions of #IRS at 2/26 FSGG hearing; supporting bills to make agency more accountable. http:// tinyurl.com/n8ozeg5
ParserError: Error tokenizing data. C error: Expected 10 fields in line 2, saw 11
I can't regenerate these CSV files. What I'm wondering is, how can I preprocess/fix them so that they are properly formatted (i.e., escape quotes within fields)? Or, is there a way to read them into a dataframe directly even with unescaped quotes?