0

I am working on a way to split up data in a CSV file based on a timestamp.

For example, for a given object id, check each entries date and see if it is within a given, allowed range. So if a set of rows in the table were:

OBJECT ID   -   Info    -   Date
obj1           xyz         1/1/12
obj1           xyw         1/2/12
obj1           cya         1/3/12
obj1           abc         2/1/12
...

In this example, the fourth entry is well outside of the area of time that the other entries are in. Therefore, my desired behavior is for a script to assign that entry to a new object, say 'obj2' for example, such that it is separated from data within its own cluster. Note that the dataset this will be applied to will be somewhat large, at the very least in the 10s of thousands, so I don't know if manual algorithms will be fast enough.

I'm using R for the moment to try to get this done using the PAM and PAMK functions in the FPC package. This gives me a plot of the clusters (I think), but I don't know how to apply this information to the actual data.

Any thoughts or ideas on the best way to do this?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
The Whether Man
  • 362
  • 5
  • 21

1 Answers1

0

I figured out a solution using the following steps:

// Convert the timestamps to milliseconds
newData <- as.POSIXct(data$date, format="date_format_here")

// Split the data using the object ID as the parameter
splitData <- split(data, f=data$id)

// Iterate over the split sessions, concatenating the cluster IDs as it goes using paste
for each {
    pamk.result <- pamk(splitData[[i]][dataColumnIndex]
    newData[i,1] <- paste(data[i,1], 
                        pamk.result$pamobject$clustering[[x]], 
                        sep="delimiter_here")
}

Anyway, this is a rough outline of how I approached the problem. Maybe this will give some ideas to others down the line.

The Whether Man
  • 362
  • 5
  • 21