5

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.

The CSV files have the following format:

Image of CSV format: Example CSV files

I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.

Raw text file:

"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 @A (A1):  Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 @C (C1):  Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 @D (D1):  LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 @E (E1):  Ik kijk Bureau sport op Nederland 3. #bureausport  #kijkes";"E (E1)";"2012-06-05 00:00:27"

Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).

Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.

http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/

http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/

Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!

Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):

filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))              

This results in the following error:

Error in file(file, "rt") : cannot open the connection 
In addition: Warning message: 
In file(file, "rt") : 
  cannot open file '..': No such file or directory 

Update

I have tried the following code:

 # grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',     col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

But I run into the following errors:

After the 3rd line I get:

  Error in read.table(file = file, header = header, sep = sep, quote = quote,  :  more columns than column names

After the 4th line I get:

  Error: object 'my.df' not found

I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?

To solve the extra column problem I adjusted the code slightly:

 #grab our list of filenames
 filenames <- list.files(path = ".", pattern='^.*\\.csv$')
 # write a special little read.csv function to do exactly what we want
 my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',   col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
 # read in all those files into one giant data.frame
 my.df <- do.call("rbind", lapply(filenames, my.read.csv))
 # remove the duplicate tweets
 my.new.df <- my.df[!duplicated(my.df$tweet),]

I tried this code on all the files, although R clearly started processing, I eventually got the following errors:

 Error in read.table(file = file, header = header, sep = sep, quote = quote,  : more columns than column names
 In addition: Warning messages:
 1: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_22_30 2012 .csv'
 2: In read.table(file = file, header = header, sep = sep, quote = quote,  : incomplete final line found by readTableHeader on 'Twitts -  di mei 29 19_24_31 2012 .csv'

 Error: object 'my.df' not found

What did I do wrong?

Community
  • 1
  • 1
Gert
  • 103
  • 1
  • 1
  • 7
  • Show some of the code that you are using. It's possible that you are sending the wrong `header` argument to your `read.csv()`. – Andrie Jun 09 '12 at 13:45
  • Your question is clear enough, but it's not clear what you have done so far and why it doesn't work. Show the `read.csv()` call that you use to read a file. Then we can comment on what you are doing wrong. – Andrie Jun 09 '12 at 14:12
  • I edited my question, hope this is what your after? – Gert Jun 09 '12 at 14:13
  • Does `filename` contain the correct list of files that you want to import? This code clearly falls over on one of the `read.csv` statements. It could be that you need to change the `list.files()` to return the full path. What is your working directory? – Andrie Jun 09 '12 at 14:23
  • The working directory is the file that contains all the CSV files. Thus list.files() should 'load' the CSV files I'm after. For the filename part, this is specified by the files loaded by list.files isn't? – Gert Jun 09 '12 at 14:28
  • You say `list.files` "should" get the correct files. Does it? Did you check? If your working directory is different from `~/` then it won't work. – Andrie Jun 09 '12 at 14:29
  • I just checked it, and it does get the correct files! I tried the code again and now I got the respond: 'NULL'. – Gert Jun 09 '12 at 14:31

1 Answers1

8

First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like

filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))

This should get you a data.frame with the contents of all the tweets

A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:

read.csv('fred.csv', header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4))

Nb. changed so all columns are character, and ';' separated

I'd parse out the time later if it was needed...

A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like

my.new.df <- my.df[!duplicated(my.df$tweet),]

For unique by author, I'd append the two fields - hard to know what works without the real data though!

my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]

So bringing it all together and assuming a few things along the way...

# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]

Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...

Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:

filenames <- list.files(path = ".", pattern='^.*\\.csv$')

n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
  cat('about to read from file:', fnam, '\n')
  if (exists('tmp.df')) rm(tmp.df)
  tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
             col.names=c('ID','tweet','author','local.time','extra'),
             colClasses=rep('character', 5)) 
  if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
    cat('  successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
    # now lets append a column containing the originating file name
    # so that debugging the file contents is easier
    tmp.df$fnam <- fnam

    # now lets rbind everything together
    if (exists('my.df')) {
      my.df <- rbind(my.df, tmp.df)
    } else {
      my.df <- tmp.df
    }
  } else {
    cat('  read NO rows from ', fnam, '\n')
  }
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]
Sean
  • 3,765
  • 3
  • 26
  • 48
  • Thnx Sean, will give this a go tomorrow! There are only .csv files in the folder, so the pattern part seems to be unnecessary.. – Gert Jun 09 '12 at 17:03
  • I had some time to spare, and thus decided to test your suggestion Sean. I got the following error after trying the first part of the code.. Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names – Gert Jun 09 '12 at 18:35
  • Hi there, could you post the first few lines of one of your csv files (assuming that's ok) and indicate if all of them have the same format? – Tim P Jun 10 '12 at 00:45
  • Tim, I have edited my question and included an image as an example of my csv files. I have chosen an image because simply copy pasting ruined the lay-out of the question. All CSV files have the same format, the number of tweets varies with a max. of 1500 per csv file. – Gert Jun 10 '12 at 09:20
  • Seems like the header of your CSV files wouldn't fit the columns. Can you check that? – fotNelton Jun 10 '12 at 09:21
  • I think you mean that the headers (tweet, author and local.time) should all move 1 column to the right, if so you are right. And this is the case for all the csv files.. If not how can I check that? – Gert Jun 10 '12 at 09:27
  • @Sean, first of all thank you very much for your effort!! As you suggested I replaced the read.csv part, for dealing with the headers, this resulted in the following formula: [read.csv('Twitts - di jun 05 00_04_48 2012 .csv', header=FALSE, skip=1, col.names=c('ID','tweet','author','local.time'), colClasses=c('numeric',rep('character', 3)))] This resulted in the following error: [Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '"1";"tweet"] This is the first tweet in the csv file, it seems the 1 is still problematic? – Gert Jun 11 '12 at 12:24
  • @Gert, can you post the top few lines of the actual file as text - not as interpreted by a spreadsheet - i.e. open the csv in an editor and post the top few lines (obfuscate the contents of the fields if you like). What are the separator characters between the fields? is it a comma or a semi-colon ";" ? – Sean Jun 11 '12 at 13:39
  • @Gert, separately you could replace the colClasses=c('numeric',rep('character', 3)) with colClasses=rep('character', 4) if you're unsure the first column is purely numeric – Sean Jun 11 '12 at 13:40
  • @Sean, just tested it on 1 file and it seems to be working. Although I got the following warning messages: Warning messages: 1: In print.default(m, ..., quote = quote, right = right) : it is not known that wchar_t is Unicode on this platform 2: In print.default(m, ..., quote = quote, right = right) : it is not known that wchar_t is Unicode on this platform 3: In print.default(m, ..., quote = quote, right = right) : it is not known that wchar_t is Unicode on this platform Do you think this is problematic? – Gert Jun 11 '12 at 14:46
  • @Sean, any suggestions on how to integrate the read.csv part into the first part (meant for merging the files), such that I can apply this to all the files in the folder? Furthermore can I simply use write.table to create the 'master' .csv file with all tweets? – Gert Jun 11 '12 at 15:19
  • @Gert, Given that the speed of reading a file from disk is orders of magnitude slower than calculations, it isn't going to make any noticeable difference to the speed if you just use a for loop, and it will be more readable to boot, so feel free to just loop through all the files using something like for (fnam in filenames) my.df <- rbind(my.df, read.csv(fnam, etc...)) – Sean Jun 11 '12 at 20:49
  • @Gert, those complains about wchar_t matter if you care about accents on your characters being correctly rendered, and matching correctly. It is possible to use iconv() to persuade them to do the right thing - but that's just too hard from here! I'd need to have access to the actual files on your machine etc! Make that a separate problem - this "question" has become far too specific and diverse! If you like the answer that I've given please accept, and up-vote! :-) Thanks – Sean Jun 11 '12 at 20:53
  • @Sean, feel like a real noob at the moment, but I haven't been able to set up the loop such that it reads all .csv files. Hope you can help adjust the code such that it will read all CSV files and afterwards merge them in a new 'master' csv file? The strange thing is that read.csv works fine if I specify a specific file (e.g. Twitts - di jun 05 00_04_48 2012 .csv) however ones I try the loop you suggested I get all sorts of errors. I included the code I am talking about in the question. Thanks again for all your help! – Gert Jun 12 '12 at 19:06
  • @Sean, sorry to bother you again.. Really think we are getting close! Edited my question, so you can check the progress made. – Gert Jun 13 '12 at 12:46
  • @Gert, that's about as far as we can go on this problem. I suggest you try and break down any further issues into tiny chunks and solve each individually on toy sets of data, and if you need help come back to SO with new questions. I must emphasise the importance of giving us reproducible data to play with so that the answering process is smoother. Hadley has written a short note on this at https://github.com/hadley/devtools/wiki/Reproducibility – Sean Jun 13 '12 at 14:21