I'm trying to extract each company's data from some free NASDAQ 100 Twitter data sets available here. The ultimate goal, once constructed and curated, is to run some modeling experiments with the dataframe. The basic dataframe form I'm aiming for is:
ATVI 49.02 0.44 0.91 7193022 .3
ADBE 119.91 0.31 0.26 1984225 .1
AKAM 64.2 0.65 1.02 1336622 .1
ALXN 126.55 0.86 0.67 2182253 .2
GOOG 838.68 3.31 0.4 1261517 1.0
AMZN 853 2.5 0.29 2048187 1.0
For each company, there are six .xlsx files (unzipped into separate directories), each excel file with multiple worksheets inside. For right now, I'm just trying to extract the first worksheet from each of the six excels spreadsheets for each company. All of those worksheets have two columns with varying numbers of rows and the data labels are on different rows, e.g. file 1, company 1:
Keyword $AAPL -
Total tweets 166631
Total audience 221363515
Contributors 42738
Original tweets 91614
Replies 4964
RTs 70053
Images and links 43361
file 2, company 1:
Keyword $AAPL -
Total audience 221363515
Contributors 42738
Total tweets 166631
Total potential impressions 1.250.920.501
Measured data from 2016-04-02 18:06
Measured data to 2016-06-15 12:23
Tweets per contributor 3,90
Impressions / Audience 5,65
Measured time in seconds 6373058
Measured time in minutes 106218
Measured time in hours 1770
Measured time in days 74
Tweets per second 0.026146161
Tweets per minute 1.568769655
Tweets per hour 94.1261793
Tweets per day 2259.028303
I'm trying to implement readxl
as suggested in this post and then put each companies data into a row of a dataframe [below]. Right now, I'm setting the first path as my directory and then running the code, then setting the second path and running it again to add the new row (I know this isn't optimal, see below).
library(readxl)
#create empty dataframe to assemble all the rows
cdf <- data.frame()
#setwd('...\\NASDAQ_100\\aal_2016_06_15_12_01_41')
#setwd('...\\NASDAQ_100\\aapl_2016_06_15_14_30_09')
#constructing list of all .xlsx files in current directory
file.list <- list.files(pattern='*.xlsx')
#using read_excel function to read each file in list and put in a dataframe of lists
df.list <- lapply(file.list, read_excel)
#converting the dataframe of lists to a 77x2 dataframe
df <- as.data.frame(do.call(rbind, df.list),stringsAsFactors=FALSE)
#transposing the dataframe to prepare to stack multiple companies data in single dataframe
df <- t(df)
#making sure that the dataframe entry values are numeric
df <- transform(df,as.numeric)
#appending the 2nd row with the actual data into the dataframe that will have all companies' data
cdf <- rbind(cdf,df[2,])
Sample output:
> cdf[,1:8]
X1 X2 X3 X4 X5 X6 X7 X8
$AAL 6507 14432722 1645 5211 459 837 938 14432722
$AAPL - 166631 221363515 42738 91614 4964 70053 43361 221363515
Upon inspection, I'm finding that there are levels in my columns which I've gathered from various other posts is because of how I imported the data and is why I tried adding stringsAsFactors=FALSE
to the as.data.frame
, but clearly that's not the solution:
> cdf[,2]
$AAL $AAPL -
14432722 221363515
Levels: 14432722 Total audience 221363515
According to the documentation, that's not an argument for read_excel
. Is there a way to still use it, but avoid these levels?
Once I've sorted this out, I'm hoping to get this in a basic for loop to go through all the unzipped sub-directories:
dir.list <- list.dirs(recursive = F)
for (subdir in dir.list) {
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
df <- as.data.frame(do.call(rbind, df.list),stringsAsFactors=FALSE)
df <- t(df)
df <- transform(df,as.numeric)
cdf <- rbind(cdf,df[2,])
}
But this yields > cdf data frame with 0 columns and 0 rows
? I know that none of the code is elegant or compact (& that rbind is ill-advised in for loops), but it's what I've been able to piece together. I'm very receptive to style corrections and alternate methods, but it'd be very much appreciated if their context were explained within the overall problem/solution described here (i.e.: not just, "use package xyz" or "read ldply()'s documentation").
Thanks,