# coding: utf-8 # # Pandas # # [Pandas](http://pandas.pydata.org/) is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of _tabular_ data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job. # # A [recent analysis](https://stackoverflow.blog/2017/09/06/incredible-growth-python/) of questions from Stack Overflow showed that python is the fastest growing and most widely used programming language in the world (in developed countries). # # ![python growth](https://zgab33vy595fw5zq-zippykid.netdna-ssl.com/wp-content/uploads/2017/09/growth_major_languages-1-1024x878.png) # # A [follow-up analysis](https://stackoverflow.blog/2017/09/14/python-growing-quickly/) showed that this growth is driven by the data science packages such as numpy, matplotlib, and especially pandas. # # ![pandas growth](https://zgab33vy595fw5zq-zippykid.netdna-ssl.com/wp-content/uploads/2017/09/related_tags_over_time-1-1024x1024.png) # # The exponential growth of pandas is due to the fact that it _just works_. It saves you time and helps you do science more efficiently and effictively. # # ### Pandas capabilities (from the Pandas website): # # * A fast and efficient DataFrame object for data manipulation with integrated indexing; # * Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; # * Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form; # * Flexible reshaping and pivoting of data sets; # * Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; # * Columns can be inserted and deleted from data structures for size mutability; # * Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets; # * High performance merging and joining of data sets; # * Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; # * Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data; # * Highly optimized for performance, with critical code paths written in Cython or C. # * Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more. # # In this lecture, we will go over the basic capabilities of Pandas. It is a very deep library, and you will need to dig into the [documentation](http://pandas.pydata.org/pandas-docs/stable/) for more advanced usage. # # Pandas was created by [Wes McKinney](http://wesmckinney.com/). Many of the examples here are drawn from Wes McKinney's book [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do), which includes a github repo of [code samples](https://github.com/wesm/pydata-book). # In[1]: import pandas as pd import numpy as np from matplotlib import pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # ## Pandas Data Structures: Series # # A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an _index_. The index contains the labels that we use to access the data. # # There are many ways to [create a Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series). We will just show a few. # In[2]: names = ['Ryan', 'Chiara', 'Johnny'] values = [35, 36, 1.8] ages = pd.Series(values, index=names) ages # Series have built in plotting methods. # In[3]: ages.plot(kind='bar') # Arithmetic operations and most numpy function can be applied to Series. # An important point is that the Series keep their index during such operations. # In[4]: np.log(ages) / ages**2 # We can access the underlying index object if we need to: # In[5]: ages.index # We can get values back out using the index via the `.loc` attribute # In[6]: ages.loc['Johnny'] # Or by raw position using `.iloc` # In[7]: ages.iloc[2] # If we need to, we can always get the raw data back out as well # In[8]: ages.values # In[9]: ages.index # ## Pandas Data Structures: DataFrame # # There is a lot more to Series, but they are limit to a single "column". A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It's a lot like a table in a spreadsheet. # # Below we create a DataFrame. # In[10]: # first we create a dictionary data = {'age': [35, 36, 1.8], 'height': [180, 155, 83], 'weight': [72.5, np.nan, 11.3]} df = pd.DataFrame(data, index=['Ryan', 'Chiara', 'Johnny']) df # Pandas handles missing data very elegantly, keeping track of it through all calculations. # In[11]: df.info() # A wide range of statistical functions are available on both Series and DataFrames. # In[12]: df.min() # In[13]: df.mean() # In[14]: df.std() # In[15]: df.describe() # We can get a single column as a Series using python's getitem syntax on the DataFrame object. # In[16]: df['height'] # ...or using attribute syntax. # In[17]: df.height # New columns can easily be added to DataFrames # In[18]: df['density'] = df.weight / df.height df # ## Merging Data # # Pandas supports a wide range of methods for merging different datasets. These are described extensively in the [documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html). Here we just give a few examples. # In[19]: education = pd.Series(['PhD', 'PhD', None, 'masters'], index=['Ryan', 'Chiara', 'Johnny', 'Takaya'], name='education') # returns a new DataFrame df.join(education) # In[20]: # returns a new DataFrame df.join(education, how='right') # In[21]: # returns a new DataFrame df.reindex(['Ryan', 'Chiara', 'Johnny', 'Takaya', 'Kerry']) # We can also index using a boolean series. This is very useful # In[22]: adults = df[df.age > 18] adults # In[23]: df['is_adult'] = df.age > 18 df # ## Plotting # # DataFrames have all kinds of [useful plotting](https://pandas.pydata.org/pandas-docs/stable/visualization.html) built in. # In[24]: df.plot(kind='scatter', x='age', y='height', grid=True) # In[25]: df.plot(kind='bar') # ## Time Indexes # # Indexes are very powerful. They are a big part of why Pandas is so useful. There are different indices for different types of data. Time Indexes are especially great! # In[26]: two_years = pd.date_range(start='2014-01-01', end='2016-01-01', freq='D') timeseries = pd.Series(np.sin(2 *np.pi *two_years.dayofyear / 365), index=two_years) timeseries.plot() # We can use python's slicing notation inside `.loc` to select a date range. # In[27]: timeseries.loc['2015-01-01':'2015-07-01'].plot() # ## Stock Market Data # # * oops - Ryan's links don't work, so we will use static files # # Now we read some stock market data from Google finance. I have created direct links to Google and Apple stock price data. # In[28]: #!curl -L -o goog.csv http://tinyurl.com/rces-goog #!curl -L -o aapl.csv http://tinyurl.com/rces-aapl-csv get_ipython().system(' cp /home/pangeo/notebooks/GOOGL.csv goog.csv') get_ipython().system(' cp /home/pangeo/notebooks/AAPL.csv aapl.csv') # In[29]: get_ipython().system(' head goog.csv') # We can see that this is well-formated, tidy CSV data, ready for immediate ingestion into Pandas. # We use Pandas' amazing [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to do this. # In[30]: goog = pd.read_csv('goog.csv') goog.head() # Not bad! But we can do better by giving read_csv some hints. # In[31]: goog = pd.read_csv('goog.csv', parse_dates=[0], index_col=0) goog.head() # In[32]: goog.info() # In[33]: goog.Close.plot() # In[34]: aapl = pd.read_csv('aapl.csv', parse_dates=[0], index_col=0) aapl.info() # In[35]: aapl_close = aapl.Close.rename('aapl') goog_close = goog.Close.rename('goog') stocks = pd.concat([aapl_close, goog_close], axis=1) stocks.head() # In[36]: stocks.plot() # Pandas knows how to take correlations. And [tons of other computations](https://pandas.pydata.org/pandas-docs/stable/computation.html). # In[37]: stocks.corr() # Because it understands times, it can do really cool stuff like resampling. # In[38]: # resample by taking the mean over each month fig, ax = plt.subplots() stocks.resample('MS').mean().plot(ax=ax, colors=['r', 'b']) # and each year stocks.resample('AS').mean().plot(ax=ax, colors=['r', 'b']) # The string `QS` means "month start. The string `AS` mean "year start". There is a long list of possible [frequency aliases](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases). # # We can also apply other reduction operations with resample. These are described in the [resample docs](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling). # In[39]: # get resample object rs = stocks.goog.resample('MS') # standard deviation of each month rs.std().plot() # ## Temperature Data # # We download some timeseries data from the [Berkeley Earth(http://berkeleyearth.org/) surface temperature dataset. This is timeseries data from various locations around earth. Let's get our local temperatures. # In[40]: get_ipython().system(' curl -o nyc_temp.txt http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt') # If we examine this data, we see it is NOT a well formated CSV file. Loading it will be a bit painful, but Pandas makes the job retatively easy. # In[41]: get_ipython().system(' head -72 nyc_temp.txt | tail -8') # In[42]: ##### http://berkeleyearth.lbl.gov/locations/40.99N-74.56W # http://berkeleyearth.lbl.gov/auto/Local/TAVG/Text/40.99N-74.56W-TAVG-Trend.txt #temp = pd.read_csv('nyc_temp.txt') col_names = ['year', 'month', 'monthly_anom'] + 10*[] temp = pd.read_csv('nyc_temp.txt', header=None, usecols=[0, 1, 2], names=col_names, delim_whitespace=True, comment='%') temp.head() # In[43]: # need a day date_df = temp.drop('monthly_anom', axis=1) date_df['day'] = 1 date_index = pd.DatetimeIndex(pd.to_datetime(date_df)) temp = temp.set_index(date_index).drop(['year', 'month'], axis=1) temp.head() # In[44]: temp.plot() # In[45]: fig, ax = plt.subplots() temp.plot(ax=ax) temp.resample('AS').mean().plot(ax=ax) temp.resample('10AS').mean().plot(ax=ax) # Pandas can do both time-based resampling and operation over fixed-length rolling windows. These are very similar but distinct; see [discussion in Pandas docs](https://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling-vs-resampling). # In[46]: # more advanced operation on rolling windows def difference_max_min(data): return data.max() - data.min() rw = temp.rolling('365D') rw.apply(difference_max_min).plot() # To create a "climatology" (i.e. the average of all same months), we can use Pandas' [groupby](https://pandas.pydata.org/pandas-docs/stable/groupby.html) functionality. # In[47]: # diurnal cycle has been removed! temp.groupby(temp.index.month).mean().plot() # In[48]: # find the hottest years temp.groupby(temp.index.year).mean().sort_values('monthly_anom', ascending=False).head(10) # ## Groupby # # Now we will explore groupby's capabilities more in a public dataset from the City of New York: the [Rat Information Portal](The Rat Information Portal)! # In[49]: # https://data.cityofnewyork.us/Health/Rats/amyk-xiv9 rats = pd.read_csv('https://data.cityofnewyork.us/api/views/amyk-xiv9/rows.csv', parse_dates=['APPROVED_DATE', 'INSPECTION_DATE']) # In[50]: rats.info() # In[51]: rats.head() # Let's do some grouping to explore the data. # In[52]: rats.groupby('INSPECTION_TYPE')['INSPECTION_TYPE'].count() # In[53]: rats.groupby('BORO_CODE')['BORO_CODE'].count().head() # In[54]: rats.groupby('STREET_NAME')['STREET_NAME'].count().head(20) # This dataset clearly needs some cleaning. We can Pandas' [text features](https://pandas.pydata.org/pandas-docs/stable/text.html) to strip the whitespace out of the data. # In[55]: # clean up street name street_names_cleaned = rats.STREET_NAME.str.strip() street_names_cleaned.groupby(street_names_cleaned).count().head(20) # In[56]: count = street_names_cleaned.groupby(street_names_cleaned).count() count.sort_values(ascending=False).head(20) # To get a better idea of the geography, let's plot the locations of the inspections. But first let's look at the statistics. # In[57]: rats[['LATITUDE', 'LONGITUDE']].describe() # There are clearly some weird outliers in the location data. We need to strip these out before plotting. # In[58]: valid_latlon = rats[(rats.LATITUDE > 30) & (rats.LONGITUDE < -70)] valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', C='BORO_CODE', cmap='Set1') # In[59]: # https://github.com/pandas-dev/pandas/issues/10678 valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False) # In[60]: valid_latlon.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False, bins='log', cmap='magma') # In[61]: manhattan_rats = valid_latlon[valid_latlon.BORO_CODE==1] manhattan_rats.plot.hexbin('LONGITUDE', 'LATITUDE', sharex=False, bins='log', cmap='magma') # In[62]: inspection_date = pd.DatetimeIndex(rats.INSPECTION_DATE) fig, ax = plt.subplots() rats.groupby(inspection_date.weekday)['JOB_ID'].count().plot(kind='bar', ax=ax) ax.set_xlabel('weekday'); # In[63]: fig, ax = plt.subplots() rats.groupby(inspection_date.hour)['JOB_ID'].count().plot(kind='bar', ax=ax) ax.set_xlabel('hour'); # In[64]: fig, ax = plt.subplots() rats.groupby(inspection_date.month)['JOB_ID'].count().plot(kind='bar', ax=ax) ax.set_xlabel('month')