gps - interpolation/lookup in R -
i'm switching r excel , wondering how in r.
have dataset looks this:
df1<-data.frame(zipcode=c("7941ah","7941ag","7941ah","7941az"), from=c(2,30,45,1), to=c(20,38,57,8), type=c("even","mixed","odd","mixed"), gps=c(12345,54321,11221,22331)) df2<-data.frame(zipcode=c("7914ah", "7914ah", "7914ah", "7914ag","7914ag","7914az"), housenum=c(18, 19, 50, 32, 104,11))
first dataset contains zipcode, house number range (from , to), type meaning if range contains even, odd or mixed house numbers , gps coordinates. second dataset contains address (zipcode, house number).
what want lookup gps coordinates df2. example address zipcode 7941ag , housenumber 18 (even number between 2 , 20) has gps coordinate 12345.
update: didn't cross mind size of dataset important chosen solution (i know, bit naive...) here information: actual size of df1 472.000 observations , df2 has 1.1 million observations. number of unique zipcodes in df1 280.000. stumbled upon post speed loop operation in r interesting findings, don't know how incorporate in solution provided @josilber
given large data frames, best bet may merge df1
, df2
zip codes (aka every pair of rows data frames have same zip code), filter house number criteria, remove duplicates (cases multiple rules df1
match), , store information matched houses. let's start sample dataset of size indicated:
set.seed(144) df1 <- data.frame(zipcode=sample(1:280000, 472000, replace=true), from=sample(1:50, 472000, replace=true), to=sample(51:100, 472000, replace=true), type=sample(c("even", "odd", "mixed"), 472000, replace=true), gps=sample(1:100, 472000, replace=true)) df2 <- data.frame(zipcode=sample(1:280000, 1.1e6, replace=true), housenum=sample(1:100, 1.1e6, replace=true))
now can perform efficient computation of gps data:
get.gps <- function(df1, df2) { # add id df2 df2$id <- 1:nrow(df2) m <- merge(df1, df2, by.x="zipcode", by.y="zipcode") m <- m[m$housenum >= m$from & m$housenum <= m$to & (m$type == "mixed" | (m$type == "odd" & m$housenum %% 2 == 1) | (m$type == "even" & m$housenum %% 2 == 0)),] m <- m[!duplicated(m$id) & !duplicated(m$id, fromlast=true),] gps <- rep(na, nrow(df2)) gps[m$id] <- m$gps return(gps) } system.time(get.gps(df1, df2)) # user system elapsed # 16.197 0.561 17.583
this more acceptable runtime -- 18 seconds instead of 90 hours estimated in comment of other answer!
Comments
Post a Comment