Python pandas timeseries resample giving unexpected results -


the data here bank account running balance. want resample data use end of day balance, last value given day. there can multiple data points day, representing multiple transactions.

in [1]: stringio import stringio  in [2]: import pandas pd  in [3]: import numpy np  in [4]: print "pandas version", pd.__version__ pandas version 0.12.0  in [5]: print "numpy version", np.__version__ numpy version 1.7.1  in [6]: data_string = stringio(""""date","balance"    ...: "08/09/2013","1000"    ...: "08/09/2013","950"    ...: "08/09/2013","930"    ...: "08/06/2013","910"    ...: "08/02/2013","900"    ...: "08/01/2013","88"    ...: "08/01/2013","87"    ...: """)  in [7]: ts = pd.read_csv(data_string, parse_dates=[0], index_col=0)  in [8]: print ts             balance date                2013-08-09     1000 2013-08-09      950 2013-08-09      930 2013-08-06      910 2013-08-02      900 2013-08-01       88 2013-08-01       87 

i expect "2013-08-09" 1000, not 'middle' number 950.

in [10]: ts.balance.resample('d', how='last') out[10]:  date 2013-08-01     88 2013-08-02    900 2013-08-03    nan 2013-08-04    nan 2013-08-05    nan 2013-08-06    910 2013-08-07    nan 2013-08-08    nan 2013-08-09    950 freq: d, dtype: float64 

i expect "2013-08-09" 930, or "2013-08-01" 88.

in [12]: ts.balance.resample('d', how='first') out[12]:  date 2013-08-01      87 2013-08-02     900 2013-08-03     nan 2013-08-04     nan 2013-08-05     nan 2013-08-06     910 2013-08-07     nan 2013-08-08     nan 2013-08-09    1000 freq: d, dtype: float64 

am missing here? resampling 'first' , 'last' not work way i'm expecting to?

to able resample data pandas first have sort it. if load data , sort index following thing:

>>> pd.read_csv(data_string, parse_dates=[0], index_col=0).sort_index()             balance date                2013-08-01       87 2013-08-01       88 2013-08-02      900 2013-08-06      910 2013-08-09     1000 2013-08-09      930 2013-08-09      950 

which explains why got results got. @jeff explained why order "arbitrary" , according comment solution use mergesort algorithm on data before operations...

>>> df = pd.read_csv(data_string, parse_dates=[0],                      index_col=0).sort_index(kind='mergesort') >>> df.balance.resample('d',how='last') 2013-08-01      88 2013-08-02     900 2013-08-03     nan 2013-08-04     nan 2013-08-05     nan 2013-08-06     910 2013-08-07     nan 2013-08-08     nan 2013-08-09    1000 >>> df.balance.resample('d', how='first') 2013-08-01     87 2013-08-02    900 2013-08-03    nan 2013-08-04    nan 2013-08-05    nan 2013-08-06    910 2013-08-07    nan 2013-08-08    nan 2013-08-09    930 

Comments

Popular posts from this blog

java - Run a .jar on Heroku -

java - Jtable duplicate Rows -

validation - How to pass paramaters like unix into windows batch file -