# coding: utf-8

# # Pandas
# 
# [Pandas](http://pandas.pydata.org/) is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of _tabular_ data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.
# 
# A [recent analysis](https://stackoverflow.blog/2017/09/06/incredible-growth-python/) of questions from Stack Overflow showed that python is the fastest growing and most widely used programming language in the world (in developed countries).
# 
# ![python growth](https://zgab33vy595fw5zq-zippykid.netdna-ssl.com/wp-content/uploads/2017/09/growth_major_languages-1-1024x878.png)
# 
# A [follow-up analysis](https://stackoverflow.blog/2017/09/14/python-growing-quickly/) showed that this growth is driven by the data science packages such as numpy, matplotlib, and especially pandas.
# 
# ![pandas growth](https://zgab33vy595fw5zq-zippykid.netdna-ssl.com/wp-content/uploads/2017/09/related_tags_over_time-1-1024x1024.png)
# 
# The exponential growth of pandas is due to the fact that it _just works_. It saves you time and helps you do science more efficiently and effictively.
# 
# ### Pandas capabilities (from the Pandas website):
# 
# * A fast and efficient DataFrame object for data manipulation with integrated indexing;
# * Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
# * Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
# * Flexible reshaping and pivoting of data sets;
# * Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
# * Columns can be inserted and deleted from data structures for size mutability;
# * Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
# * High performance merging and joining of data sets;
# * Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
# * Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
# * Highly optimized for performance, with critical code paths written in Cython or C.
# * Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.
# 
# In this lecture, we will go over the basic capabilities of Pandas. It is a very deep library, and you will need to dig into the [documentation](http://pandas.pydata.org/pandas-docs/stable/) for more advanced usage.
# 
# Pandas was created by [Wes McKinney](http://wesmckinney.com/). Many of the examples here are drawn from Wes McKinney's book [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do), which includes a github repo of [code samples](https://github.com/wesm/pydata-book).

# In[1]:


import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# ## Pandas Data Structures: Series
# 
# A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an _index_. The index contains the labels that we use to access the data.
# 
# There are many ways to [create a Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series). We will just show a few.

# In[2]:


names = ['Ryan', 'Chiara', 'Johnny']
values = [35, 36, 1.8]
ages = pd.Series(values, index=names)
ages


# Series have built in plotting methods.

# In[3]:


ages.plot(kind='bar')


# Arithmetic operations and most numpy function can be applied to Series.
# An important point is that the Series keep their index during such operations.

# In[4]:


np.log(ages) / ages**2


# We can access the underlying index object if we need to:

# In[5]:


ages.index


# We can get values back out using the index via the `.loc` attribute

# In[6]:


ages.loc['Johnny']


# Or by raw position using `.iloc`

# In[7]:


ages.iloc[2]


# If we need to, we can always get the raw data back out as well

# In[8]:


ages.values


# In[9]:


ages.index


# ## Pandas Data Structures: DataFrame
# 
# There is a lot more to Series, but they are limit to a single "column". A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It's a lot like a table in a spreadsheet.
# 
# Below we create a DataFrame.

# In[10]:


# first we create a dictionary
data = {'age': [35, 36, 1.8],
        'height': [180, 155, 83],
        'weight': [72.5, np.nan, 11.3]}
df = pd.DataFrame(data, index=['Ryan', 'Chiara', 'Johnny'])
df


# Pandas handles missing data very elegantly, keeping track of it through all calculations.

# In[11]:


df.info()


# A wide range of statistical functions are available on both Series and DataFrames.

# In[12]:


df.min()


# In[13]:


df.mean()


# In[14]:


df.std()


# In[15]:


df.describe()


# We can get a single column as a Series using python's getitem syntax on the DataFrame object.

# In[16]:


df['height']


# ...or using attribute syntax.

# In[17]:


df.height


# New columns can easily be added to DataFrames

# In[18]:


df['density'] = df.weight / df.height
df


# ## Merging Data
# 
# Pandas supports a wide range of methods for merging different datasets. These are described extensively in the [documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html). Here we just give a few examples.

# In[19]:


education = pd.Series(['PhD', 'PhD', None, 'masters'],
                     index=['Ryan', 'Chiara', 'Johnny', 'Takaya'],
                     name='education')
# returns a new DataFrame
df.join(education)


# In[20]:


# returns a new DataFrame
df.join(education, how='right')


# In[21]:


# returns a new DataFrame
df.reindex(['Ryan', 'Chiara', 'Johnny', 'Takaya', 'Kerry'])


# We can also index using a boolean series. This is very useful

# In[22]:


adults = df[df.age > 18]
adults


# In[23]:


df['is_adult'] = df.age > 18
df


# ## Plotting
# 
# DataFrames have all kinds of [useful plotting](https://pandas.pydata.org/pandas-docs/stable/visualization.html) built in.

# In[24]:


df.plot(kind='scatter', x='age', y='height', grid=True)


# In[25]:


df.plot(kind='bar')


# ## Time Indexes
# 
# Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great!

# In[26]:


two_years = pd.date_range(start='2014-01-01', end='2016-01-01', freq='D')
timeseries = pd.Series(np.sin(2 *np.pi *two_years.dayofyear / 365),
                       index=two_years)
timeseries.plot()


# We can use python's slicing notation inside `.loc` to select a date range.

# In[27]:


timeseries.loc['2015-01-01':'2015-07-01'].plot()


# ## Stock Market Data
# 
# * oops - Ryan's links don't work, so we will use static files
# 
# Now we read some stock market data from Google finance. I have created direct links to Google and Apple stock price data.

# In[28]:


#!curl -L -o goog.csv http://tinyurl.com/rces-goog
#!curl -L -o aapl.csv http://tinyurl.com/rces-aapl-csv
get_ipython().system(' cp /home/pangeo/notebooks/GOOGL.csv goog.csv')
get_ipython().system(' cp /home/pangeo/notebooks/AAPL.csv aapl.csv')


# In[29]:


get_ipython().system(' head goog.csv')


# We can see that this is well-formated, tidy CSV data, ready for immediate ingestion into Pandas.
# We use Pandas' amazing [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to do this.

# In[30]:


goog = pd.read_csv('goog.csv')
goog.head()


# Not bad! But we can do better by giving read_csv some hints.

# In[31]:


goog = pd.read_csv('goog.csv', parse_dates=[0], index_col=0)
goog.head()


# In[32]:


goog.info()


# In[33]:


goog.Close.plot()


# In[34]:


aapl = pd.read_csv('aapl.csv', parse_dates=[0], index_col=0)
aapl.info()


# In[35]:


aapl_close = aapl.Close.rename('aapl')
goog_close = goog.Close.rename('goog')
stocks = pd.concat([aapl_close, goog_close], axis=1)
stocks.head()


# In[36]:


stocks.plot()


# Pandas knows how to take correlations. And [tons of other computations](https://pandas.pydata.org/pandas-docs/stable/computation.html).

# In[37]:


stocks.corr()


# Because it understands times, it can do really cool stuff like resampling.

# In[38]:


# resample by taking the mean over each month

fig, ax = plt.subplots()
stocks.resample('MS').mean().plot(ax=ax, colors=['r', 'b'])
# and each year
stocks.resample('AS').mean().plot(ax=ax, colors=['r', 'b'])


# The string `QS` means "month start. The string `AS` mean "year start". There is a long list of possible [frequency aliases](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases).
# 
# We can also apply other reduction operations with resample. These are described in the [resample docs](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling).

# In[39]:


# get resample object
rs = stocks.goog.resample('MS')
# standard deviation of each month
rs.std().plot()


# ## Temperature Data
# 
# We download some timeseries data from the [Berkeley Earth(http://berkeleyearth.org/) surface temperature dataset. This is timeseries data from various locations around earth. Let's get our local temperatures.

# In[40]:


get_ipython().system(' curl -o nyc_temp.txt http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt')


# If we examine this data, we see it is NOT a well formated CSV file. Loading it will be a bit painful, but Pandas makes the job retatively easy.

# In[41]:


get_ipython().system(' head -72 nyc_temp.txt | tail -8')


# In[42]:


##### http://berkeleyearth.lbl.gov/locations/40.99N-74.56W
# http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt


#temp = pd.read_csv('nyc_temp.txt')

col_names = ['year', 'month', 'monthly_anom'] + 10*[]
temp = pd.read_csv('nyc_temp.txt',
                   header=None, usecols=[0, 1, 2], names=col_names,
                   delim_whitespace=True, comment='%')

temp.head()


# In[43]:


# need a day
date_df = temp.drop('monthly_anom', axis=1)
date_df['day'] = 1
date_index = pd.DatetimeIndex(pd.to_datetime(date_df))
temp = temp.set_index(date_index).drop(['year', 'month'], axis=1)
temp.head()


# In[44]:


temp.plot()


# In[45]:


fig, ax = plt.subplots()
temp.plot(ax=ax)
temp.resample('AS').mean().plot(ax=ax)
temp.resample('10AS').mean().plot(ax=ax)


# Pandas can do both time-based resampling and operation over fixed-length rolling windows. These are very similar but distinct; see [discussion in Pandas docs](https://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling-vs-resampling).

# In[46]:


# more advanced operation on rolling windows
def difference_max_min(data):
    return data.max() - data.min()

rw = temp.rolling('365D')
rw.apply(difference_max_min).plot()


# To create a "climatology" (i.e. the average of all same months), we can use Pandas' [groupby](https://pandas.pydata.org/pandas-docs/stable/groupby.html) functionality.

# In[47]:


# diurnal cycle has been removed!
temp.groupby(temp.index.month).mean().plot()


# In[48]:


# find the hottest years
temp.groupby(temp.index.year).mean().sort_values('monthly_anom', ascending=False).head(10)


# ## Groupby
# 
# Now we will explore groupby's capabilities more in a public dataset from the City of New York: the [Rat Information Portal](The Rat Information Portal)!

# In[49]:


# https://data.cityofnewyork.us/Health/Rats/amyk-xiv9
rats = pd.read_csv('https://data.cityofnewyork.us/api/views/amyk-xiv9/rows.csv',
                  parse_dates=['APPROVED_DATE', 'INSPECTION_DATE'])


# In[50]:


rats.info()


# In[51]:


rats.head()


# Let's do some grouping to explore the data.

# In[52]:


rats.groupby('INSPECTION_TYPE')['INSPECTION_TYPE'].count()


# In[53]:


rats.groupby('BORO_CODE')['BORO_CODE'].count().head()


# In[54]:


rats.groupby('STREET_NAME')['STREET_NAME'].count().head(20)


# This dataset clearly needs some cleaning. We can Pandas' [text features](https://pandas.pydata.org/pandas-docs/stable/text.html) to strip the whitespace out of the data.

# In[55]:


# clean up street name
street_names_cleaned = rats.STREET_NAME.str.strip()
street_names_cleaned.groupby(street_names_cleaned).count().head(20)


# In[56]:


count = street_names_cleaned.groupby(street_names_cleaned).count()
count.sort_values(ascending=False).head(20)


# To get a better idea of the geography, let's plot the locations of the inspections. But first let's look at the statistics.

# In[57]:


rats[['LATITUDE', 'LONGITUDE']].describe()


# There are clearly some weird outliers in the location data. We need to strip these out before plotting.

# In[58]:


valid_latlon = rats[(rats.LATITUDE > 30) & (rats.LONGITUDE < -70)]
valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', C='BORO_CODE', cmap='Set1')


# In[59]:


# https://github.com/pandas-dev/pandas/issues/10678
valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False)


# In[60]:


valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False, bins='log', cmap='magma')


# In[61]:


manhattan_rats = valid_latlon[valid_latlon.BORO_CODE==1]
manhattan_rats.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False, bins='log', cmap='magma')


# In[62]:


inspection_date = pd.DatetimeIndex(rats.INSPECTION_DATE)

fig, ax = plt.subplots()
rats.groupby(inspection_date.weekday)['JOB_ID'].count().plot(kind='bar', ax=ax)
ax.set_xlabel('weekday');


# In[63]:


fig, ax = plt.subplots()
rats.groupby(inspection_date.hour)['JOB_ID'].count().plot(kind='bar', ax=ax)
ax.set_xlabel('hour');


# In[64]:


fig, ax = plt.subplots()
rats.groupby(inspection_date.month)['JOB_ID'].count().plot(kind='bar', ax=ax)
ax.set_xlabel('month')