python - Interpolate (or extrapolate) only small gaps in pandas dataframe -


i have pandas dataframe time index (1 min freq) , several columns worth of data. data contains nan. if so, want interpolate if gap not longer 5 minutes. in case maximum of 5 consecutive nans. data may (several test cases, show problems):

import numpy np import pandas pd datetime import datetime  start = datetime(2014,2,21,14,50) data = pd.dataframe(index=[start + timedelta(minutes=1*x) x in range(0, 8)],                          data={'a': [123.5, np.nan, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],                                'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],                                'c': [123.5, 132.3, 136.3, 164.3] + [np.nan]*4,                                'd': [np.nan]*8,                                'e': [np.nan]*7 + [2330.3],                                'f': [np.nan]*4 + [2763.0, 2142.3, 2127.3, 2330.3],                                'g': [2330.3] + [np.nan]*7,                                'h': [2330.3] + [np.nan]*6 + [2777.7]}) 

it reads this:

in [147]: data out[147]:                                b      c   d       e       f       g       h 2014-02-21 14:50:00  123.5  433.5  123.5 nan     nan     nan  2330.3  2330.3 2014-02-21 14:51:00    nan  523.2  132.3 nan     nan     nan     nan     nan 2014-02-21 14:52:00  136.3  536.3  136.3 nan     nan     nan     nan     nan 2014-02-21 14:53:00  164.3  464.3  164.3 nan     nan     nan     nan     nan 2014-02-21 14:54:00  213.0  413.0    nan nan     nan  2763.0     nan     nan 2014-02-21 14:55:00  164.3  164.3    nan nan     nan  2142.3     nan     nan 2014-02-21 14:56:00  213.0  213.0    nan nan     nan  2127.3     nan     nan 2014-02-21 14:57:00  221.1  221.1    nan nan  2330.3  2330.3     nan  2777.7 

i aware of data.interpolate() has several flaws, produces result, columns a-e, columns f-h fails different reasons::

                              b      c   d       e       f       g  \ 2014-02-21 14:50:00  123.5  433.5  123.5 nan     nan     nan  2330.3    2014-02-21 14:51:00  129.9  523.2  132.3 nan     nan     nan  2330.3    2014-02-21 14:52:00  136.3  536.3  136.3 nan     nan     nan  2330.3    2014-02-21 14:53:00  164.3  464.3  164.3 nan     nan     nan  2330.3    2014-02-21 14:54:00  213.0  413.0  164.3 nan     nan  2763.0  2330.3    2014-02-21 14:55:00  164.3  164.3  164.3 nan     nan  2142.3  2330.3    2014-02-21 14:56:00  213.0  213.0  164.3 nan     nan  2127.3  2330.3    2014-02-21 14:57:00  221.1  221.1  164.3 nan  2330.3  2330.3  2330.3                                    h   2014-02-21 14:50:00  2330.300000   2014-02-21 14:51:00  2394.214286   2014-02-21 14:52:00  2458.128571   2014-02-21 14:53:00  2522.042857   2014-02-21 14:54:00  2585.957143   2014-02-21 14:55:00  2649.871429   2014-02-21 14:56:00  2713.785714   2014-02-21 14:57:00  2777.700000  

f) gap consists of 4 minutes worth of nans in beginning, should replaced value 2763.0 (i.e. extrapolating backwards in time)

g) gap longer 5 minutes still gets extrapolated

h) gap longer 5 minutes still gap interpolated.

i understand reasons, of course specified should not interpolate longer gaps 5 minutes. understand interpolate extrapolates forward in time, want extrapolate backward in time. there known methods can use problem, without reinventing wheel?

edit: method data.interpolate accepts input parameter limit, defines maximum number of consecutive nans substituted interpolation. still interpolates limit, want go on nans in case.

so here mask ought solve problem. interpolate , apply mask reset appropriate values nan. honestly, bit more work realized because had loop through each column groupby didn't quite work without me providing dummy columns 'ones'.

anyway, can explain if unclear couple of lines hard understand. see here little bit more of explanation of trick on df['new'] line or print out individual lines better see going on.

mask = data.copy() in list('abcdefgh'):     df = pd.dataframe( data[i] )     df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())     df['ones'] = 1     mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()  in [7]: data out[7]:                                b      c   d       e       f       g       h 2014-02-21 14:50:00  123.5  433.5  123.5 nan     nan     nan  2330.3  2330.3 2014-02-21 14:51:00    nan  523.2  132.3 nan     nan     nan     nan     nan 2014-02-21 14:52:00  136.3  536.3  136.3 nan     nan     nan     nan     nan 2014-02-21 14:53:00  164.3  464.3  164.3 nan     nan     nan     nan     nan 2014-02-21 14:54:00  213.0  413.0    nan nan     nan  2763.0     nan     nan 2014-02-21 14:55:00  164.3  164.3    nan nan     nan  2142.3     nan     nan 2014-02-21 14:56:00  213.0  213.0    nan nan     nan  2127.3     nan     nan 2014-02-21 14:57:00  221.1  221.1    nan nan  2330.3  2330.3     nan  2777.7  in [8]: mask out[8]:                              b     c      d      e     f      g      h 2014-02-21 14:50:00  true  true  true  false  false  true   true   true 2014-02-21 14:51:00  true  true  true  false  false  true  false  false 2014-02-21 14:52:00  true  true  true  false  false  true  false  false 2014-02-21 14:53:00  true  true  true  false  false  true  false  false 2014-02-21 14:54:00  true  true  true  false  false  true  false  false 2014-02-21 14:55:00  true  true  true  false  false  true  false  false 2014-02-21 14:56:00  true  true  true  false  false  true  false  false 2014-02-21 14:57:00  true  true  true  false   true  true  false   true 

it's easy there if don't fancier respect extrapolation:

in [9]: data.interpolate().bfill()[mask] out[9]:                                b      c   d       e       f       g       h 2014-02-21 14:50:00  123.5  433.5  123.5 nan     nan  2763.0  2330.3  2330.3 2014-02-21 14:51:00  129.9  523.2  132.3 nan     nan  2763.0     nan     nan 2014-02-21 14:52:00  136.3  536.3  136.3 nan     nan  2763.0     nan     nan 2014-02-21 14:53:00  164.3  464.3  164.3 nan     nan  2763.0     nan     nan 2014-02-21 14:54:00  213.0  413.0  164.3 nan     nan  2763.0     nan     nan 2014-02-21 14:55:00  164.3  164.3  164.3 nan     nan  2142.3     nan     nan 2014-02-21 14:56:00  213.0  213.0  164.3 nan     nan  2127.3     nan     nan 2014-02-21 14:57:00  221.1  221.1  164.3 nan  2330.3  2330.3     nan  2777.7 

edit add: here's faster (about 2x on sample data) , simpler way, moving stuff outside of loop:

mask = data.copy() grp = ((mask.notnull() != mask.shift().notnull()).cumsum()) grp['ones'] = 1 in list('abcdefgh'):     mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull() 

Comments

Popular posts from this blog

Java 3D LWJGL collision -

spring - SubProtocolWebSocketHandler - No handlers -

methods - python can't use function in submodule -