Python pandas timeseries resample giving unexpected results -

- September 15, 2014

the data here bank account running balance. want resample data use end of day balance, last value given day. there can multiple data points day, representing multiple transactions.

in [1]: stringio import stringio  in [2]: import pandas pd  in [3]: import numpy np  in [4]: print "pandas version", pd.__version__ pandas version 0.12.0  in [5]: print "numpy version", np.__version__ numpy version 1.7.1  in [6]: data_string = stringio(""""date","balance"    ...: "08/09/2013","1000"    ...: "08/09/2013","950"    ...: "08/09/2013","930"    ...: "08/06/2013","910"    ...: "08/02/2013","900"    ...: "08/01/2013","88"    ...: "08/01/2013","87"    ...: """)  in [7]: ts = pd.read_csv(data_string, parse_dates=[0], index_col=0)  in [8]: print ts             balance date                2013-08-09     1000 2013-08-09      950 2013-08-09      930 2013-08-06      910 2013-08-02      900 2013-08-01       88 2013-08-01       87

i expect "2013-08-09" 1000, not 'middle' number 950.

in [10]: ts.balance.resample('d', how='last') out[10]:  date 2013-08-01     88 2013-08-02    900 2013-08-03    nan 2013-08-04    nan 2013-08-05    nan 2013-08-06    910 2013-08-07    nan 2013-08-08    nan 2013-08-09    950 freq: d, dtype: float64

i expect "2013-08-09" 930, or "2013-08-01" 88.

in [12]: ts.balance.resample('d', how='first') out[12]:  date 2013-08-01      87 2013-08-02     900 2013-08-03     nan 2013-08-04     nan 2013-08-05     nan 2013-08-06     910 2013-08-07     nan 2013-08-08     nan 2013-08-09    1000 freq: d, dtype: float64

am missing here? resampling 'first' , 'last' not work way i'm expecting to?

to able resample data pandas first have sort it. if load data , sort index following thing:

>>> pd.read_csv(data_string, parse_dates=[0], index_col=0).sort_index()             balance date                2013-08-01       87 2013-08-01       88 2013-08-02      900 2013-08-06      910 2013-08-09     1000 2013-08-09      930 2013-08-09      950

which explains why got results got. @jeff explained why order "arbitrary" , according comment solution use mergesort algorithm on data before operations...

>>> df = pd.read_csv(data_string, parse_dates=[0],                      index_col=0).sort_index(kind='mergesort') >>> df.balance.resample('d',how='last') 2013-08-01      88 2013-08-02     900 2013-08-03     nan 2013-08-04     nan 2013-08-05     nan 2013-08-06     910 2013-08-07     nan 2013-08-08     nan 2013-08-09    1000 >>> df.balance.resample('d', how='first') 2013-08-01     87 2013-08-02    900 2013-08-03    nan 2013-08-04    nan 2013-08-05    nan 2013-08-06    910 2013-08-07    nan 2013-08-08    nan 2013-08-09    930

Search This Blog

Share

Python pandas timeseries resample giving unexpected results -

Comments

Post a Comment

Popular posts from this blog

Line ending issue with Mercurial or Visual Studio -

R - Plot: How to format in 10-base scientific notation and put it text, mtex, title etc functions? -

validation - How to pass paramaters like unix into windows batch file -