Python pandas timeseries resample giving unexpected results -
the data here bank account running balance. want resample data use end of day balance, last value given day. there can multiple data points day, representing multiple transactions.
in [1]: stringio import stringio in [2]: import pandas pd in [3]: import numpy np in [4]: print "pandas version", pd.__version__ pandas version 0.12.0 in [5]: print "numpy version", np.__version__ numpy version 1.7.1 in [6]: data_string = stringio(""""date","balance" ...: "08/09/2013","1000" ...: "08/09/2013","950" ...: "08/09/2013","930" ...: "08/06/2013","910" ...: "08/02/2013","900" ...: "08/01/2013","88" ...: "08/01/2013","87" ...: """) in [7]: ts = pd.read_csv(data_string, parse_dates=[0], index_col=0) in [8]: print ts balance date 2013-08-09 1000 2013-08-09 950 2013-08-09 930 2013-08-06 910 2013-08-02 900 2013-08-01 88 2013-08-01 87
i expect "2013-08-09" 1000, not 'middle' number 950.
in [10]: ts.balance.resample('d', how='last') out[10]: date 2013-08-01 88 2013-08-02 900 2013-08-03 nan 2013-08-04 nan 2013-08-05 nan 2013-08-06 910 2013-08-07 nan 2013-08-08 nan 2013-08-09 950 freq: d, dtype: float64
i expect "2013-08-09" 930, or "2013-08-01" 88.
in [12]: ts.balance.resample('d', how='first') out[12]: date 2013-08-01 87 2013-08-02 900 2013-08-03 nan 2013-08-04 nan 2013-08-05 nan 2013-08-06 910 2013-08-07 nan 2013-08-08 nan 2013-08-09 1000 freq: d, dtype: float64
am missing here? resampling 'first' , 'last' not work way i'm expecting to?
to able resample data pandas first have sort it. if load data , sort index following thing:
>>> pd.read_csv(data_string, parse_dates=[0], index_col=0).sort_index() balance date 2013-08-01 87 2013-08-01 88 2013-08-02 900 2013-08-06 910 2013-08-09 1000 2013-08-09 930 2013-08-09 950
which explains why got results got. @jeff explained why order "arbitrary" , according comment solution use mergesort
algorithm on data before operations...
>>> df = pd.read_csv(data_string, parse_dates=[0], index_col=0).sort_index(kind='mergesort') >>> df.balance.resample('d',how='last') 2013-08-01 88 2013-08-02 900 2013-08-03 nan 2013-08-04 nan 2013-08-05 nan 2013-08-06 910 2013-08-07 nan 2013-08-08 nan 2013-08-09 1000 >>> df.balance.resample('d', how='first') 2013-08-01 87 2013-08-02 900 2013-08-03 nan 2013-08-04 nan 2013-08-05 nan 2013-08-06 910 2013-08-07 nan 2013-08-08 nan 2013-08-09 930
Comments
Post a Comment