-3

I have attached a screenshot of a cattle market. The Tag no column gives the identity of each animal (unique), the CONTACT_NO SKP (phone numbers; this column is used to identify each visitor) gives the identities of each visitor (there are many cases with the same Contact NO_SKP, showing that same people own many animals), the SDATE SKP gives the date of each visitors transaction in the market, and the Distance KM gives the distance from the cattle market to the visitors place of origin.

The BREED SKP, GENDER SKP,COLOUR SKP,AWEIGHT SKP,AGE SKP all give details of animal quality while SALE_PURPOSE SKP gives an idea of the reason of trade. All other variables can be ignored.   I am not understanding how to proceed forward .e.g

(1) How do you suggest I go on segmenting the visitors and combining them with respect to the relevant variables? For instance, I want to combine all the visitors that have the same phone numbers together and treat them as one person along with combining the number of animals each person has, the purpose of their trade, type of animal, the distance He travelled (i.e along all variables), and compare this person against all the other visitors to this cattle market etc..

enter image description here

M--
  • 25,431
  • 8
  • 61
  • 93
FMP
  • 13
  • 3
  • 4
    Please use `dput` to output a **small** portion of your dataset -- don't post pictures. Also, what have you tried? We see no effort on your behalf. – blacksite May 16 '17 at 19:15
  • I don't know how to start the analysis....I can't put the data into the format from where I can work. – FMP May 16 '17 at 20:06
  • I have been editing and merging the data for the past two days...the snippet that I posted was the end result by making it a lot simpler – FMP May 16 '17 at 20:07

1 Answers1

0

Depending on what you want to do, here are a few things that come to my head. This is caveman code, and there are some much more sophisticated tools out there for dataset manipulation.

Let's say you have a dataset that looks like this:

df <- data.frame(phone=c("555-1234","555-6789","555-1111","555-1234","555-1234"),breed=c("holstien","hereford",NA,"holstien","holstien"),price=c(200,300,NA,300,400),distance=c(10,20,30,10,10))

df
#      phone    breed price distance
# 1 555-1234 holstien   200       10
# 2 555-6789 hereford   300       20
# 3 555-1111     <NA>    NA       30
# 4 555-1234 holstien   300       10
# 5 555-1234 holstien   400       10

A few summaries by individual:

with(df, table(phone, breed))  # number of each breed for each person
#           breed
#   phone      hereford holstien
#   555-1111        0        0
#   555-1234        0        3
#   555-6789        1        0

with(df, tapply(price, phone, mean))  # average amount spent by each person
# 555-1111 555-1234 555-6789 
#       NA      300      300

with(df, tapply(price, phone, sum))  # total amount spent by each person
# 555-1111 555-1234 555-6789 
#       NA      900      300

with(df, tapply(distance, phone, min))  # distance for each person (I cheated a little)
# 555-1111 555-1234 555-6789 
#       30       10       20

Which can then be put together into a new data.frame

unique_phone <- with(df, sort(unique(phone)))
avg_amount <- with(df, tapply(price, phone, mean))
tot_amount <- with(df, tapply(price, phone, sum))
dist <- with(df, tapply(distance, phone, min))
df_pp <- data.frame(unique_phone, avg_amount, tot_amount, dist)

df_pp  # note that this could be cleaner, but the info is there
#          unique_phone avg_amount tot_amount dist
# 555-1111     555-1111         NA         NA   30
# 555-1234     555-1234        300        900   10
# 555-6789     555-6789        300        300   20

There are much cleaner ways of doing this, and hopefully someone who knows thedplyr package and its friends better than I do can weigh in. I'm hoping this can give you enough of a skeleton to get what you need - this can of course be added to.

Matt Tyers
  • 2,125
  • 1
  • 14
  • 23
  • Thanks Matt....it'll take a while for me to understand whats going on and try it on my data.....I think this should give me a start...I will work on it and get back to you as soon as I have something to show....I really really appreciate your help...had no idea where to start. – FMP May 16 '17 at 21:47
  • Hey matt, I tried your formula but it says Error: cannot allocate vector of size 1.4 Gb – FMP May 16 '17 at 23:42
  • But I get the idea...however, I have like 35000 observations and 19 variables in my set..so I guess it can't take such big data. – FMP May 16 '17 at 23:42
  • Ugh. Maybe turn `x <- with(df, tapply(thing, stuff, function))` into `x <- unname(with(df, tapply(thing, stuff, function))`. Maybe not having to store names will use less memory? – Matt Tyers May 16 '17 at 23:50