2

I have a file in CSV format which contains a table with column "id", "timestamp", "action", "value" and "location". I want to apply a function to each row of the table and I've already written the code in R as follows:

user <- read.csv(file_path,sep = ";")
num <- nrow(user)
curLocation <- "1"
for(i in 1:num) {
    row <- user[i,]
    if(user$action != "power")
        curLocation <- row$value
    user[i,"location"] <- curLocation
}

The R script works fine and now I want to apply it SparkR. However, I couldn't access the ith row directly in SparkR and I couldn't find any function to manipulate every row in SparkR documentation.

Which method should I use in order to achieve the same effect as in the R script?

In addition, as advised by @chateaur, I tried to code using dapply function as follows:

curLocation <- "1"
schema <- structType(structField("Sequence","integer"), structField("ID","integer"), structField("Timestamp","timestamp"), structField("Action","string"), structField("Value","string"), structField("Location","string"))
setLocation <- function(row, curLoc) {
    if(row$Action != "power|battery|level"){
        curLoc <- row$Value
    }
    row$Location <- curLoc
}
bw <- dapply(user, function(row) { setLocation(row, curLocation)}, schema)
head(bw)

Then I got an error: error message

I looked up the warning message the condition has length > 1 and only the first element will be used and I found something https://stackoverflow.com/a/29969702/4942713. It made me wonder whether the row parameter in the dapply function represent an entire partition of my data frame instead of one single row? Maybe dapply function is not a desirable solution?

Later, I tried to modify the function as advised by @chateaur. Instead of using dapply, I used dapplyCollect which saves me the effort of specifying the schema. It works!

changeLocation <- function(partitionnedDf) {
    nrows <- nrow(partitionnedDf)
    curLocation <- "1"
    for(i in 1:nrows){
        row <- partitionnedDf[i,]
        if(row$action != "power") {
            curLocation <- row$value
        }
    partitionnedDf[i,"location"] <- curLocation
    }
    partitionnedDf
}

bw <- dapplyCollect(user, changeLocation)
Community
  • 1
  • 1

1 Answers1

2

Scorpion775,

You should share your sparkR code. Don't forget that data isn't manipulated the same way in R and sparkR.

From : http://spark.apache.org/docs/latest/sparkr.html,

df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")

Then you can look at dapply function here : https://spark.apache.org/docs/2.1.0/api/R/dapply.html

Here is a working example :

changeLocation <- function(partitionnedDf) {
    nrows <- nrow(partitionnedDf)
    curLocation <- as.integer(1)

    # Loop over each row of the partitionned data frame
    for(i in 1:nrows){
        row <- partitionnedDf[i,]

        if(row[1] != "power") {
            curLocation <- row[2]
        }
        partitionnedDf[i,3] <- curLocation
    }

    # Return modified data frame
    partitionnedDf
}

# Load data
df <- read.df("data.csv", "csv", header="false", inferSchema = "true")

head(collect(df))

# Define schema of dataframe
schema <- structType(structField("action", "string"), structField("value", "integer"),
                     structField("location", "integer"))

# Change location of each row                    
df2 <- dapply(df, changeLocation, schema)

head(df2)
chateaur
  • 346
  • 1
  • 13
  • I took a look at the dapply function and found out that it is used for "Apply a **function** to each partition of a SparkDataFrame". From my understanding, the notion _partition_ has nothing to do with _row_. My concern is, I don't know how to write the **function** to be applied to the SparkDataFrame. Currently I only know how to implement the **function** I want in R but not in SparkR. Could you give me some advice? – Scorpion775 Feb 14 '17 at 09:00
  • I am not a spark expert, but I think partitions are data splitted to be spread over the cluster. Could you try the above example and tell me if it suits your need ? – chateaur Feb 14 '17 at 18:06
  • Thank you for the advice. I tried to follow your instruction but got an error as shown in the question. – Scorpion775 Feb 15 '17 at 08:51
  • I edited my post, try it and feedback :) My previous mistake was to think that in dapply function we had rows. In fact we have a data frame. I believe that spark will cut the data frame, send each part to a different node and apply the function (here changeLocation). If anyone could confirm ? – chateaur Feb 15 '17 at 21:28
  • It works as long as I use the **dapplyCollect** function instead. In which case, I don't need to specify the schema. – Scorpion775 Feb 17 '17 at 07:18