2
  • I have sales data from walmart like superstore.

  • It has 5 variables: ProductId, Category, Mode of Operations, Sales, Profit Margin

  • Out of these 5 variables two are unique i.e ProductId and Category. The other 3 can change its value for each month.

  • The data is for 23 months.

So, if I had one product, life would be simple, and it would be a 2 dimensional data. But as my number of products are over 20,000, I am a little confused how to tackle this data so I can run linear regression, cart and other functions on it.

Edit1:example

read.table(text="
A B C1 D1 E1 C2 D2 E2 C3 D3 E3 C4 D4 E4 C5 D5 E5 C6 D6 E6 

1 K  X  5  5  X  6  7  Y  2  1  Z  0  0  X  6  7  X  7  9

2 L  Y  5  4  X  6  9  Z  2  3  Z  0  0  X  6  6  X  7 10

3 M  X  5  5  Z  6  7  X  2  1  Y  0  0  Y  6  7  Y  7  9",
header=TRUE)

C1,D1,E1 are values for 1st month, C2,D2,E2 are values for second month and so on.

I would expect it to become

A B C D E MONTH

1 K X 5 5 1

1 K X 6 7 2

1 K Y 2 1 3

1 K Z 0 0 4

1 K X 6 7 5

1 K X 7 9 6

2 L Y 5 4 1

2 L X 6 9 2

. . .

structure of raw data

enter image description here

structure of data after applying melt, transform and cast function

enter image description here

Which is not something I was expecting.

Under A should have 4 levels (i.e Joint Venture, Reseller, Distribution, 0)

Under B should have prices

Under C should have profit margin which in my data is always in decimal places 0.xxx

what am I doing wrong?

edit 3

dcast function is not picking values from value column to assign to variables A,B and C. Where it is getting all these 1s to put under A, B and C?

> names(wushang_transformed2)

[1] "Productcode" "Category"    "CAT"         "Month"       "value"     

> wushang_casted = dcast(wushang_transformed2, Productcode+Category+Month~CAT, value.var="value")

What's wrong with this command?

Community
  • 1
  • 1
Jay khan
  • 745
  • 2
  • 9
  • 22
  • Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and the expected output. – Molx Aug 11 '15 at 04:04
  • Did it, i hope it can clarify the problem. thanks – Jay khan Aug 11 '15 at 04:58
  • Instead of adding a picture of your data, can you edit your post and cut and paste the data in, use the result of: `dput(head(wushang[,1:12]))` – N8TRO Aug 11 '15 at 16:46

1 Answers1

2
# Read in the data
read.table(text="
A B C1 D1 E1 C2 D2 E2 C3 D3 E3 C4 D4 E4 C5 D5 E5 C6 D6 E6 

1 K  X  5  5  X  6  7  Y  2  1  Z  0  0  X  6  7  X  7  9

2 L  Y  5  4  X  6  9  Z  2  3  Z  0  0  X  6  6  X  7 10

3 M  X  5  5  Z  6  7  X  2  1  Y  0  0  Y  6  7  Y  7  9",
header=TRUE) -> my_df

#Load required library
library(reshape2)

# melt brings columns not in the id.vars list into variable and value
#    where variable is the original column name 

my_df_melted <- melt(my_df, id.vars=c("A", "B"))

# transform adds new columns, subtr() splits "C1" into "C" and "1"
# the [,-3] at the end drops the 3rd column

my_df_transformed <- transform(my_df_melted, CAT=substr(variable, 1, 1), MONTH=substr(variable, 2, 2))[,-3]

# dcast keeps A, B, and MONTH columns and pushes CAT into new columns
# [,c(1:2,4:6,3)] reorders the columns

my_df_casted <- dcast(my_df_transformed, A+B+MONTH~CAT)[,c(1:2,4:6,3)]

   A B C D  E MONTH
1  1 K X 5  5     1
2  1 K X 6  7     2
3  1 K Y 2  1     3
4  1 K Z 0  0     4
5  1 K X 6  7     5
6  1 K X 7  9     6
7  2 L Y 5  4     1
8  2 L X 6  9     2
9  2 L Z 2  3     3
10 2 L Z 0  0     4
11 2 L X 6  6     5
12 2 L X 7 10     6
13 3 M X 5  5     1
14 3 M Z 6  7     2
15 3 M X 2  1     3
16 3 M Y 0  0     4
17 3 M Y 6  7     5
18 3 M Y 7  9     6

Note: in case of error "Aggregation function missing: defaulting to length", either remove the duplicates by 'unique' function or add aggregate argument.

Jay khan
  • 745
  • 2
  • 9
  • 22
N8TRO
  • 3,348
  • 3
  • 22
  • 40
  • 1
    I'd recommend including the packages required for your answer. – BenBarnes Aug 11 '15 at 06:51
  • 1
    @BenBarnes And I get SO mad when people do that to me. I'm sorry. Already edited. – N8TRO Aug 11 '15 at 06:54
  • 1
    Can you please add some comments with the functions that you have used? it will help me understand what happened so i can then edit it while applying on the real dataset. Thanks – Jay khan Aug 11 '15 at 06:56
  • @Jaykhan done. I had to use these functions a few time to understand a little of what they did. try `help(melt)`, `help(transform)`, `help(substr)`, and `help(dcast)`. Check this as answered if that's what it did. – N8TRO Aug 11 '15 at 07:08
  • @N8TRO i tried what you said and this how data looks like, can you please look and see what i am doing wrong? thanks – Jay khan Aug 11 '15 at 08:20
  • @N8TRO i looked at csv files of melted, transformed and casted data frame separately and it seems like only dcast function is having some problem. – Jay khan Aug 11 '15 at 09:04
  • @N8TRO its not picking up values from the value column to fill up the three variable columns. i tried using the argument value.var="value" but nothing happened. – Jay khan Aug 11 '15 at 09:24