Pandas Basics

Import Pandas

The usual convention is to name Pandas "pd."

import pandas as pd

What is a series?

A series (often called "s" in notebooks) is a one-dimensional data structure like a list.

You can create a series by passing a regular Python list to pd.Series()

budget = [75, 90, 90, 95, 50]

budget
[75, 90, 90, 95, 50]
s = pd.Series(budget)

s
0    75
1    90
2    90
3    95
4    50
dtype: int64
s.describe()
count     5.000000
mean     80.000000
std      18.371173
min      50.000000
25%      75.000000
50%      90.000000
75%      90.000000
max      95.000000
dtype: float64
s.value_counts()
90    2
95    1
75    1
50    1
dtype: int64
s.sort_values(ascending=False)
3    95
2    90
1    90
0    75
4    50
dtype: int64

What is a data frame?

A data frame is a two-dimensional data structure, similar to an Excel table. Two-dimensional means it has both an X and Y axis.

You can create a data frame by passing a dictionary to pd.DataFrame. The column headers shoudl be the key and the column data should be the value (a list). You can also import a CSV with the pd.read_csv() function, an example starts off the Airbnb analysis notebook.

budget_dictionary = {
    'month': ['Jun', 'Jul', 'Aug', 'Sep', 'Oct'],
    'budget': budget,
    'cookie_budget': [5, 5, 5, 5, 3]
}

budget_dictionary
{'month': ['Jun', 'Jul', 'Aug', 'Sep', 'Oct'],
 'budget': [75, 90, 90, 95, 50],
 'cookie_budget': [5, 5, 5, 5, 3]}
df = pd.DataFrame(budget_dictionary)

df
month budget cookie_budget
0 Jun 75 5
1 Jul 90 5
2 Aug 90 5
3 Sep 95 5
4 Oct 50 3
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
month            5 non-null object
budget           5 non-null int64
cookie_budget    5 non-null int64
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes

Extracting columns

You can pull out the column of a data frame as a series with this syntax:

df['column_name']

or this syntax

df.column_name
cookie_budget = df['cookie_budget']

type(cookie_budget)
pandas.core.series.Series
cookie_budget.value_counts()
5    4
3    1
Name: cookie_budget, dtype: int64

Indexing with booleans

You can use a conditional to get a series of booleans, and you can then use the series of booleans to extract data that matches the conditional.

about_80_booleans = df['budget'] > 80
 
about_80_booleans
0    False
1     True
2     True
3     True
4    False
Name: budget, dtype: bool
df[about_80_booleans]['month']
1    Jul
2    Aug
3    Sep
Name: month, dtype: object

Get the second row in the data frame as a series using iloc

df.iloc[2]
month            Aug
budget            90
cookie_budget      5
Name: 2, dtype: object

Write our data frame to a CSV file

df.to_csv('cookie_budget.csv')

Create a pie chart showing which months had a cookie budget of \$3 or \$5

cookie_counts = cookie_budget.value_counts()

df['cookie_budget'].value_counts().plot(kind='pie', labels=['$5', '$3'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f651c8fe940>