0

I am dealing with a super messy data set from Nexis (where I have a bunch of articles, titles, date, author etc):

             V1        V2          V3       V4     V5        V6        V7            V8
1         1.    UNIONS UNIMPRESSED       BY GEORGE OSBORNE'S  SPENDING ANNOUNCEMENTS
2            PA Newswire:   Scotland, November    25,      2015 Wednesday          1:54
3     Newswire: Scotland,        1567   words,   Alan    Jones,     Press   Association
4 Correspondent                                                                        
5            2.  Standard        Life       to   back      HSBC      over            HQ
6           The    Herald  (Glasgow), November    24,      2015  Tuesday,           Pg.
          V9  V10    V11
1                       
2         PM BST,     PA
3 Industrial            
4                       
5       move            
6        23,  620 words,

I want to develop a count of how many articles appear per month in each year (1995-2015), although te head of the data shows that month appear in column this is not always the case. Nevertheless, I have noticed that the year appears always two colums to the right of the month (same row). So I want to develop a code that finds how many articles are from Novermber 1995, February 1995...... October 2015. Any one up to the challenge?

Kind regards

PS: in the following image one can see better the data:

enter image description here

Economist_Ayahuasca
  • 1,648
  • 24
  • 33

1 Answers1

0

As you provided no working example, I created one and hope that my code will working also on your data.

# build example
d <- data.frame(a=c(month.name[1:3],month.name[1]),b=c(letters[1:4]),c=c(70,99,15,14))
d <- apply(d, 2, as.character)
d

Now the code will loop over all columns searching in every row for one of the 12 months. In all positive rows it will extract the month and the year (two columns behind) paste them together and save it in the results.

# Loop 
result <- NULL
for( i in 1:ncol(d)){
 # get row ids including one of the 12 months
 row <- grep(paste(month.name,collapse = "|"),d[,i])
 # month per year
  if( length(row) > 0 ){
  col=i # Column
  mpy <- paste(d[row,i],d[row,i+2],sep = "_") 
  tmp <- data.frame(col,row,mpy,row.names = NULL)
  result <- rbind(result,tmp)}
 }
table(result$mpy)
Roman
  • 17,008
  • 3
  • 36
  • 49
  • Hi Jimbou, thanks very much for the answer. You give a really detailed and well explanation. Only one thing, in the month per year category, when you metion lenght(gr) > 0, what is gr? – Economist_Ayahuasca Nov 27 '15 at 12:50
  • i just edited the code. It must be the row vector instead. – Roman Nov 27 '15 at 12:52