gps - interpolation/lookup in R -

- January 15, 2012

i'm switching r excel , wondering how in r.
have dataset looks this:

df1<-data.frame(zipcode=c("7941ah","7941ag","7941ah","7941az"),                 from=c(2,30,45,1),                 to=c(20,38,57,8),                 type=c("even","mixed","odd","mixed"),                 gps=c(12345,54321,11221,22331))   df2<-data.frame(zipcode=c("7914ah", "7914ah", "7914ah", "7914ag","7914ag","7914az"),                  housenum=c(18, 19, 50, 32, 104,11))

first dataset contains zipcode, house number range (from , to), type meaning if range contains even, odd or mixed house numbers , gps coordinates. second dataset contains address (zipcode, house number).

what want lookup gps coordinates df2. example address zipcode 7941ag , housenumber 18 (even number between 2 , 20) has gps coordinate 12345.

update: didn't cross mind size of dataset important chosen solution (i know, bit naive...) here information: actual size of df1 472.000 observations , df2 has 1.1 million observations. number of unique zipcodes in df1 280.000. stumbled upon post speed loop operation in r interesting findings, don't know how incorporate in solution provided @josilber

given large data frames, best bet may merge df1 , df2 zip codes (aka every pair of rows data frames have same zip code), filter house number criteria, remove duplicates (cases multiple rules df1 match), , store information matched houses. let's start sample dataset of size indicated:

set.seed(144) df1 <- data.frame(zipcode=sample(1:280000, 472000, replace=true),                   from=sample(1:50, 472000, replace=true),                   to=sample(51:100, 472000, replace=true),                   type=sample(c("even", "odd", "mixed"), 472000, replace=true),                   gps=sample(1:100, 472000, replace=true)) df2 <- data.frame(zipcode=sample(1:280000, 1.1e6, replace=true),                   housenum=sample(1:100, 1.1e6, replace=true))

now can perform efficient computation of gps data:

get.gps <- function(df1, df2) {   # add id df2   df2$id <- 1:nrow(df2)   m <- merge(df1, df2, by.x="zipcode", by.y="zipcode")   m <- m[m$housenum >= m$from &          m$housenum <= m$to &          (m$type == "mixed" |           (m$type == "odd" & m$housenum %% 2 == 1) |           (m$type == "even" & m$housenum %% 2 == 0)),]   m <- m[!duplicated(m$id) & !duplicated(m$id, fromlast=true),]   gps <- rep(na, nrow(df2))   gps[m$id] <- m$gps   return(gps) } system.time(get.gps(df1, df2)) #    user  system elapsed  #  16.197   0.561  17.583

this more acceptable runtime -- 18 seconds instead of 90 hours estimated in comment of other answer!

Search This Blog

harsh

gps - interpolation/lookup in R -

Comments

Post a Comment

Popular posts from this blog

Java 3D LWJGL collision -

spring - SubProtocolWebSocketHandler - No handlers -

methods - python can't use function in submodule -