1

First, sorry for this long introduction, but I think this will help understanding the problem better. I am working on a project where we are trying to make use of huge floating car data to infer human mobility patterns. I am using RStudio to do so. Basically we have two files; trips.csv that contains 375,000 trips with metadata such as trip ID, start/end location (longitude, latitude) and other fields. Second file is waypoints.csv which contains full GPS waypoint data, listed trip by trip. This includes waypoint sequence, location and other fields.

In total, there are nearly 10 million waypoints (second file) for these 375,000 trip (first file). So each trip from first file has several number of waypoints in second file that together form the trajectory of that trip. The following tables show samples from each file with only the columns that I need in my problem:

Trip Data

Tripld,Lon1,Lat1,Lon2,Lat2,distance,
bb983d,11.565,48.19,11.55,48.143,7498,
da5bgg,11.584,48.157,11.639,48.098,1364,
saefeg,11.591,48.142,11. 563,48.18,7377

Way Point Data

TripId,sequence,Lon,Lat,
bb983d,0,11.565,48.19,
bb983d,1,11.56688,48.18158,
bb983d,2,11.56351,48.18144,
bb983d,3,11.56335,48.1888,
bb983d,4,11.5654,48.17617,
da5bgg,0,11.584,48.157,
da5bgg,1,11.583417,48.155167,
da5bgg,2,11.578472,48.144556,
da5bgg,3,11.57075,48.142139,
5aefeg,0,11.591,48.142,
5aefeg,1,11.58994,4813956
5aefeg,2,11.58797,48.13706

Here is the code I used to make the data frames:

dput(droplevels(head(trips)))structure(list(TripId = structure(1:6, .Label = c("00a7da9f4b503f36fc937f386b11ca58", "00aa3cb70345798d9b1d92bc4685b3ee", "017cb0697a1135c5cd3479c1edc2aa6b", "01cc30aa0e036817cf4bdc468c9fad8a", "01f0a6a90ec964ae8014d2f750231663", "02949197deca3f1d52906cfc147454c5"), class = "factor"), StartLocLat = c(48.178, 48.098, 48.15, 48.176, 48.149, 48.151), startLocLon = c(11.573, 11.501, 11.503, 11.558, 11.503, 11.563), EndLocLat =  (48.143, 48.098, 48.18, 48.168, 48.148, 48.127), EndLocLon = c(11.55, 11.639, 11.563, 11.526, 11.616, 11.554)), row.names = c(NA, 6L), class = "data.frame")

dput(droplevels(head(waypoints))) structure(list(TripId = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c  ("00a7da9f4b503f36fc937f386b11ca58", "00aa3cb70345798d9b1d92bc4685b3ee"), class = "factor"), Sequence = c(0L, 1L, 2L, 3L, 4L, 0L), Latitude = c(48.178, 48.18158, 48.18144, 48.1808, 48.17617, 48.098), Longitude = c(11.573, 11.56688, 11.56351, 11.56335, 11.5654, 11.501)), row.names = c(NA, 6L), class = "data.frame")

Now, I would like to add a column deviation area that represents the area between a virtual straight line from start point to end point of each trip, and the actual path or trajectory resulted from connecting the way points (sequence) by line segments for that trip.

The attached photo may help understanding the respective area: the desired area to be calculated

I did a quick research but didn't find what I exactly need especially that I need to do this for all trips.

Any hints/suggestions with codes -if possible- will be very very appreciated!

tcratius
  • 525
  • 6
  • 15
Rayan
  • 57
  • 6
  • Welcome to SO! Please add minimal example and wanted output: [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – pogibas Sep 01 '18 at 14:58
  • 1
    In particular, please do not provide your data as an image. For us to use it would require that we type it all in. Instead, cut and paste your data into your question so that we can just copy it to use it. – G5W Sep 01 '18 at 15:50
  • Thank you both for your advice. This my first post, sorry for that. Edits are done now! – Rayan Sep 01 '18 at 16:20

1 Answers1

0

This is how I would approach it and it could be wrong.

To properly calculate the distance between two longitudinal and latitudinal points you would normal use the Haversine formula, however that is complex math equation, so I would say that is why the distance values have been provided.

We can calculate the distance between two data points (x1, y1) and (x2, x2) using Pythagoras' Euclidean formula, never thought I say that after leave school. The value it out

distance = square_root((x2 - x1)^2 + (y2 - y1)^2.

The reason why it is squared first is because Long and Lat data can have negative values. Negative long and lat values can be used to denote direction i.e. East. Points x and y on a plot can also have negative values, not in your scenario, but it is good practice to think a head.

Now take the two data files provided above and place them in a text file and save the somewhere handy. In the console, install package dplyr to run my code.

install.packages("dplyr")

From there you can use this code: # read a csv formated data with column heads equal True and keep the current data # types via as.is equals True. Call package dplyr for use in current session. # choose.files() function also the user to pick the file(s) that he/she needs. library(dplyr) read.virtual.line <- read.csv(choose.files(), header = T, as.is = T) read.waypoints <- read.csv(choose.files(), header = TRUE)

# Convert files read into to data.frame and assign to a variable name.
df.virtual.line <- data.frame(read.virtual.line)
df.waypoints <- data.frame(read.waypoints)

# This peice of code is execute from the right of the <- first.
# Calculate the Euclidean distance and assign to dist_scale.
# mutate makes a new column called dist_scale with the result of the above
# calculation.

New Column dist_scale

Tripld  Lon1    Lat1    Lon2    Lat2    distance    dist_scale
bb983d  11.565  48.19   11.55   48.143  7498    0.049335586
da5bgg  11.584  48.157  11.639  48.098  13643   0.080659779
saefeg  11.591  48.142  11.563  48.18   7377    0.047201695

Looking at the first value of dist_scale. The starting point will be 0 and ending value 0.049335586.

The rest you will have to try your self. The way I would look at it is like this:

    1. You now have one line segment.
    1. Get all line segments and you have a polygon.
      • a. If y2 - y1 = 0 then you have reach the virtual line.
      • b. Stop and goto 3.
    1. Convert all dist_value to be in the same scale as distance variable.
    1. Calculate area of polygon using new_distance values which represent one. line segment per data point/value.
    1. Repeat until all values have been evaluated and all areas computed.

I recommend using problem decomposition to define how your code will run before writing the code i.e. steps like above. If you are having trouble writing the code then, at least try to write the steps you will take to achieve the solution. Break them up in chunks and post them here and an Stakeoverflow user will be able to help you. Don't forget to add the code that you tried.

If you get an error message while writing and running the code, please search the internet first before posting it here. There are a lot of answer out there and you will find that your answer to your problem is not unique. Type R in front of the error message in any search engine will likely give you the help you need: "R Error message".

Best of luck and hope this has helped.

tcratius
  • 525
  • 6
  • 15