python - Interpolate (or extrapolate) only small gaps in pandas dataframe -
i have pandas dataframe time index (1 min freq) , several columns worth of data. data contains nan. if so, want interpolate if gap not longer 5 minutes. in case maximum of 5 consecutive nans. data may (several test cases, show problems):
import numpy np import pandas pd datetime import datetime start = datetime(2014,2,21,14,50) data = pd.dataframe(index=[start + timedelta(minutes=1*x) x in range(0, 8)], data={'a': [123.5, np.nan, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1], 'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1], 'c': [123.5, 132.3, 136.3, 164.3] + [np.nan]*4, 'd': [np.nan]*8, 'e': [np.nan]*7 + [2330.3], 'f': [np.nan]*4 + [2763.0, 2142.3, 2127.3, 2330.3], 'g': [2330.3] + [np.nan]*7, 'h': [2330.3] + [np.nan]*6 + [2777.7]})
it reads this:
in [147]: data out[147]: b c d e f g h 2014-02-21 14:50:00 123.5 433.5 123.5 nan nan nan 2330.3 2330.3 2014-02-21 14:51:00 nan 523.2 132.3 nan nan nan nan nan 2014-02-21 14:52:00 136.3 536.3 136.3 nan nan nan nan nan 2014-02-21 14:53:00 164.3 464.3 164.3 nan nan nan nan nan 2014-02-21 14:54:00 213.0 413.0 nan nan nan 2763.0 nan nan 2014-02-21 14:55:00 164.3 164.3 nan nan nan 2142.3 nan nan 2014-02-21 14:56:00 213.0 213.0 nan nan nan 2127.3 nan nan 2014-02-21 14:57:00 221.1 221.1 nan nan 2330.3 2330.3 nan 2777.7
i aware of data.interpolate()
has several flaws, produces result, columns a-e, columns f-h fails different reasons::
b c d e f g \ 2014-02-21 14:50:00 123.5 433.5 123.5 nan nan nan 2330.3 2014-02-21 14:51:00 129.9 523.2 132.3 nan nan nan 2330.3 2014-02-21 14:52:00 136.3 536.3 136.3 nan nan nan 2330.3 2014-02-21 14:53:00 164.3 464.3 164.3 nan nan nan 2330.3 2014-02-21 14:54:00 213.0 413.0 164.3 nan nan 2763.0 2330.3 2014-02-21 14:55:00 164.3 164.3 164.3 nan nan 2142.3 2330.3 2014-02-21 14:56:00 213.0 213.0 164.3 nan nan 2127.3 2330.3 2014-02-21 14:57:00 221.1 221.1 164.3 nan 2330.3 2330.3 2330.3 h 2014-02-21 14:50:00 2330.300000 2014-02-21 14:51:00 2394.214286 2014-02-21 14:52:00 2458.128571 2014-02-21 14:53:00 2522.042857 2014-02-21 14:54:00 2585.957143 2014-02-21 14:55:00 2649.871429 2014-02-21 14:56:00 2713.785714 2014-02-21 14:57:00 2777.700000
f) gap consists of 4 minutes worth of nans in beginning, should replaced value 2763.0 (i.e. extrapolating backwards in time)
g) gap longer 5 minutes still gets extrapolated
h) gap longer 5 minutes still gap interpolated.
i understand reasons, of course specified should not interpolate longer gaps 5 minutes. understand interpolate
extrapolates forward in time, want extrapolate backward in time. there known methods can use problem, without reinventing wheel?
edit: method data.interpolate
accepts input parameter limit
, defines maximum number of consecutive nans substituted interpolation. still interpolates limit, want go on nans in case.
so here mask ought solve problem. interpolate
, apply mask reset appropriate values nan. honestly, bit more work realized because had loop through each column groupby didn't quite work without me providing dummy columns 'ones'.
anyway, can explain if unclear couple of lines hard understand. see here little bit more of explanation of trick on df['new']
line or print out individual lines better see going on.
mask = data.copy() in list('abcdefgh'): df = pd.dataframe( data[i] ) df['new'] = ((df.notnull() != df.shift().notnull()).cumsum()) df['ones'] = 1 mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull() in [7]: data out[7]: b c d e f g h 2014-02-21 14:50:00 123.5 433.5 123.5 nan nan nan 2330.3 2330.3 2014-02-21 14:51:00 nan 523.2 132.3 nan nan nan nan nan 2014-02-21 14:52:00 136.3 536.3 136.3 nan nan nan nan nan 2014-02-21 14:53:00 164.3 464.3 164.3 nan nan nan nan nan 2014-02-21 14:54:00 213.0 413.0 nan nan nan 2763.0 nan nan 2014-02-21 14:55:00 164.3 164.3 nan nan nan 2142.3 nan nan 2014-02-21 14:56:00 213.0 213.0 nan nan nan 2127.3 nan nan 2014-02-21 14:57:00 221.1 221.1 nan nan 2330.3 2330.3 nan 2777.7 in [8]: mask out[8]: b c d e f g h 2014-02-21 14:50:00 true true true false false true true true 2014-02-21 14:51:00 true true true false false true false false 2014-02-21 14:52:00 true true true false false true false false 2014-02-21 14:53:00 true true true false false true false false 2014-02-21 14:54:00 true true true false false true false false 2014-02-21 14:55:00 true true true false false true false false 2014-02-21 14:56:00 true true true false false true false false 2014-02-21 14:57:00 true true true false true true false true
it's easy there if don't fancier respect extrapolation:
in [9]: data.interpolate().bfill()[mask] out[9]: b c d e f g h 2014-02-21 14:50:00 123.5 433.5 123.5 nan nan 2763.0 2330.3 2330.3 2014-02-21 14:51:00 129.9 523.2 132.3 nan nan 2763.0 nan nan 2014-02-21 14:52:00 136.3 536.3 136.3 nan nan 2763.0 nan nan 2014-02-21 14:53:00 164.3 464.3 164.3 nan nan 2763.0 nan nan 2014-02-21 14:54:00 213.0 413.0 164.3 nan nan 2763.0 nan nan 2014-02-21 14:55:00 164.3 164.3 164.3 nan nan 2142.3 nan nan 2014-02-21 14:56:00 213.0 213.0 164.3 nan nan 2127.3 nan nan 2014-02-21 14:57:00 221.1 221.1 164.3 nan 2330.3 2330.3 nan 2777.7
edit add: here's faster (about 2x on sample data) , simpler way, moving stuff outside of loop:
mask = data.copy() grp = ((mask.notnull() != mask.shift().notnull()).cumsum()) grp['ones'] = 1 in list('abcdefgh'): mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()
Comments
Post a Comment