Pandasとは
データ解析を行うためのライブラリです
Pandasの準備
下記でpandasをインストールします
pip install pandas
下記のコードでpandasをimportします
import pandas as pd
Pandasデータの初期化
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
}
)
列の抽出
以下の通り列を抽出できます
df["Age"]
また、複数の列を抽出することもできます
df[["Name", "Age"]]
統計
統計を算出できます
df["Age"].max()
df.describe()
csvファイルの読み込み
titanic = pd.read_csv("data/titanic.csv")
先頭のファイルを見たい場合は下記を実行します
titanic.head(8)
データ型
データ型を表示します
titanic.dtypes
条件抽出
特定の行を抽出することもできます
above_35 = titanic[titanic["Age"] > 35]
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
age_no_na = titanic[titanic["Age"].notna()]
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
titanic.iloc[9:25, 2:5]
titanic.iloc[0:3, 3] = "anonymous"
プロット
air_quality.plot()
air_quality["station_paris"].plot()
air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
air_quality.plot.box()
air_quality.plot.area(figsize=(12, 4), subplots=True)
列の作成
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
air_quality["ratio_paris_antwerp"] = (
air_quality["station_paris"] / air_quality["station_antwerp"]
)
air_quality_renamed = air_quality.rename(
columns={
"station_antwerp": "BETR801",
"station_paris": "FR04014",
"station_london": "London Westminster",
}
)
要約統計量
titanic["Age"].mean()
titanic[["Age", "Fare"]].median()
titanic[["Age", "Fare"]].describe()
titanic.agg(
{
"Age": ["min", "max", "median", "skew"],
"Fare": ["min", "max", "median", "mean"],
}
)
titanic[["Sex", "Age"]].groupby("Sex").mean()
titanic.groupby("Sex").mean(numeric_only=True)
titanic.groupby("Sex")["Age"].mean()
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
titanic["Pclass"].value_counts()
titanic.groupby("Pclass")["Pclass"].count()
レイアウト変更
titanic.sort_values(by="Age").head()
titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
no2 = air_quality[air_quality["parameter"] == "no2"]
no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset.pivot(columns="location", values="value")
no2.pivot(columns="location", values="value").plot()
air_quality.pivot_table(
values="value", index="location", columns="parameter", aggfunc="mean"
)
air_quality.pivot_table(
values="value",
index="location",
columns="parameter",
aggfunc="mean",
margins=True,
)
no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
no_2 = no2_pivoted.melt(id_vars="date.utc")
no_2 = no2_pivoted.melt(
id_vars="date.utc",
value_vars=["BETR801", "FR04014", "London Westminster"],
value_name="NO_2",
var_name="id_location",
)
air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",
parse_dates=True)
air_quality_no2 = air_quality_no2[["date.utc", "location",
"parameter", "value"]]
air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",
parse_dates=True)
air_quality_pm25 = air_quality_pm25[["date.utc", "location",
"parameter", "value"]]
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)
air_quality = air_quality.sort_values("date.utc")
air_quality_ = pd.concat([air_quality_pm25, air_quality_no2], keys=["PM25", "NO2"])
stations_coord = pd.read_csv("data/air_quality_stations.csv")
air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")
air_quality = pd.merge(air_quality, air_quality_parameters,
how='left', left_on='parameter', right_on='id')