2

I've been downloading Tweets in the form of a .csv file with the following schema: username;date;retweets;favorites;text;geo;mentions;hashtags;permalink

The problem is that some tweets have semi-colons in their text attribute, for example, "I love you babe ;)"

When i'm trying to import this csv to R, i get some records with wrong schema, as can you see here: imported csv with read.csv

I think this format error is because of the csv parser founding ; in text section and separating the table there, if you understand what i mean.

I've already tried matching with the regex: (;".*)(;)(.*";) and replacing it with ($1)($3) until not more matches are found, but the error continues in the csv parsing.

Any ideas to clean this csv file? Or why the csv parser is working bad?

Thanks for reading

Edit1: I think that there is no problem in the structure more than a bad chosen separator (';'), look at these example record

Juan_Levas;2015-09-14 19:59;0;2;"Me sonrieron sus ojos; y me tembló hasta el alma.";Medellín,Colombia;;;https://twitter.com/Juan_Levas/status/643574711314710528

This is a well formatted record, but i think that the semi-colon in the text section (Marked between "") forces the parser to divide the text-section in 2 columns, in this case: "Me sonrieron sus ojos and y me tembló hasta el alma.";. Is this possible?

Also, i'm using read.csv("data.csv", sep=';') to parse the csv to a data-frame.

Edit2: How to reproduce the error:

  1. Get the csv from here [~2 MB]: Download csv
  2. Do df <- read.csv('twit_data.csv', sep=';')
  3. Explore the resulting DataFrame (You can sort it by Date, Retweets or Favorites and you will see the inconsistences in the parsing)
Vichoko
  • 353
  • 1
  • 2
  • 14
  • 1
    Which function do you use? `read.csv2`? Could you provide a sample of your CSV? – Scarabee Oct 13 '16 at 20:33
  • 1
    How did you get a bad CSV file in the first place? It's not easy to read poorly formatted input files. A more [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would be helpful (pictures of data aren't particularly useful). – MrFlick Oct 13 '16 at 20:38
  • I answered your questions as an 'edit' of the main question. Meanwhile i'll make a reproducible example for you to understand better whats happening. – Vichoko Oct 13 '16 at 22:04

1 Answers1

1

Your CSV file is not properly formatted: the problem is not the separator occurring in character fields, it's rather the fact that the " are not escaped.

The best thing to do would be to generate a new file with a proper format (typically: using RFC 4180).

If it's not possible, your best option is to use a "smart" tool like readr:

library(readr)
df <- read_csv2('twit_data.csv')

It does quite a good job on your file. (I can't see any obvious parsing error in the resulting data frame)

Community
  • 1
  • 1
Scarabee
  • 5,437
  • 5
  • 29
  • 55