0

I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.

My files look like this

source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100.

DISTANCE     COST    
1-100        4800    
101-120      5100   
121-140      5500   
141-160      5900 
161-180      6200    
181-200      6600    
210-220      6900    
221-240      7200

Analysis.xlsx

loading_station  distance_travel     total_cost    status    
PUGU                  40                4000       PAID       
PUGU                  80                3200       PAID  
MOROGORO              50                5000       PAID
MOROGORO              220               30400      PAID
DODOMA                150               5100       PAID
KIGOMA                90                2345       PAID
DODOMA                230               6000       PAID
DODOMA                180               16500      PAID
KIGOMA                32                3000       PAID
DODOMA                45                6000       PAID
DODOMA                65                5000       PAID
KIGOMA                77                1000       PAID
KIGOMA                90                4000       PAID

Actual Cost for distance is given in source_data.xlsx, I want to check cost in Analysis.xlsx if it correspond to Actual value, I want to detect underpayment and overpayment.

Desired Output should be like this, with two column added, source_cost which is taken from source_xlsx by using vlookup and difference which is difference between total_cost and source_cost

loading_station distance_travel total_cost  status  source_cost  Difference
PUGU               40                4000     PAID     4800        -800
PUGU               80                3200     PAID     4800        -1600
MOROGORO           50                5000     PAID     4800        200
MOROGORO           220               30400    PAID     6900        23500
DODOMA             150               5100     PAID     5900        -800
KIGOMA             90                2345     PAID     4800        -2455
DODOMA             230               6000     PAID     7200        -1200
DODOMA             180               16500    PAID     6200        10300
KIGOMA             32                3000     PAID     4800        -1800
DODOMA             45                6000     PAID     4800        1200
DODOMA             65                5000     PAID     4800        200
KIGOMA             77                1000     PAID     4800        -3800
KIGOMA             90                4000     PAID     4800        -800

My code so far

library(readxl)
analysis <- read_excel("~/analysis.xlsx")
View(analysis)
source_data <- read_excel("~/source_data.xlsx")
View(source_data)
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
tony michael
  • 137
  • 5

1 Answers1

2

This is somewhat dependent on what format your data is in after being read from Excel, but if it is in the format implied by your question, then you can do:

analysis$source_cost <- source_data$COST[
  findInterval(analysis$distance_travel,
               sapply(strsplit(source_data$DISTANCE, '-'), 
                               function(x) as.numeric(x[1])))]

analysis$Difference <- analysis$total_cost - analysis$source_cost

analysis
#>    loading_station distance_travel total_cost status source_cost Difference
#> 1             PUGU              40       4000   PAID        4800       -800
#> 2             PUGU              80       3200   PAID        4800      -1600
#> 3         MOROGORO              50       5000   PAID        4800        200
#> 4         MOROGORO             220      30400   PAID        6900      23500
#> 5           DODOMA             150       5100   PAID        5900       -800
#> 6           KIGOMA              90       2345   PAID        4800      -2455
#> 7           DODOMA             230       6000   PAID        7200      -1200
#> 8           DODOMA             180      16500   PAID        6200      10300
#> 9           KIGOMA              32       3000   PAID        4800      -1800
#> 10          DODOMA              45       6000   PAID        4800       1200
#> 11          DODOMA              65       5000   PAID        4800        200
#> 12          KIGOMA              77       1000   PAID        4800      -3800
#> 13          KIGOMA              90       4000   PAID        4800       -800

Data from question in reproducible format

source_data <- structure(list(DISTANCE = c("1-100", "101-120", "121-140", 
"141-160", "161-180", "181-200", "210-220", "221-240"), COST = c(4800L, 
5100L, 5500L, 5900L, 6200L, 6600L, 6900L, 7200L)), 
class = "data.frame", row.names = c(NA, -8L))

analysis <- structure(list(loading_station = c("PUGU", "PUGU", "MOROGORO", 
"MOROGORO", "DODOMA", "KIGOMA", "DODOMA", "DODOMA", "KIGOMA", 
"DODOMA", "DODOMA", "KIGOMA", "KIGOMA"), distance_travel = c(40L, 
80L, 50L, 220L, 150L, 90L, 230L, 180L, 32L, 45L, 65L, 77L, 90L
), total_cost = c(4000L, 3200L, 5000L, 30400L, 5100L, 2345L, 
6000L, 16500L, 3000L, 6000L, 5000L, 1000L, 4000L), status = c("PAID", 
"PAID", "PAID", "PAID", "PAID", "PAID", "PAID", "PAID", "PAID", 
"PAID", "PAID", "PAID", "PAID")), class = "data.frame", row.names = c(NA, 
-13L))

Created on 2022-08-14 by the reprex package (v2.0.1)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 2
    Nice! Addding the base function [**`findInterval`**](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/findInterval) to my toolkit. – Parfait Aug 14 '22 at 15:25
  • 2
    @Parfait thanks. I'm a latecomer to `findInterval` myself, and must have written a (worse) algorithm to do the same job many times over until it was pointed out to me... – Allan Cameron Aug 14 '22 at 15:28