Pandas入門【Pythonデータサイエンス】

Python

Pandasとは

データ解析を行うためのライブラリです

Pandasの準備

下記でpandasをインストールします

pip install pandas

下記のコードでpandasをimportします

import pandas as pd

Pandasデータの初期化

df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

列の抽出

以下の通り列を抽出できます

df["Age"]

また、複数の列を抽出することもできます

df[["Name", "Age"]]

統計

統計を算出できます

df["Age"].max()
df.describe()

csvファイルの読み込み

titanic = pd.read_csv("data/titanic.csv")

先頭のファイルを見たい場合は下記を実行します

titanic.head(8)

データ型

データ型を表示します

titanic.dtypes

条件抽出

特定の行を抽出することもできます

above_35 = titanic[titanic["Age"] > 35]
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
age_no_na = titanic[titanic["Age"].notna()]
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
titanic.iloc[9:25, 2:5]
titanic.iloc[0:3, 3] = "anonymous"

プロット

air_quality.plot()
air_quality["station_paris"].plot()
air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
air_quality.plot.box()
air_quality.plot.area(figsize=(12, 4), subplots=True)

列の作成

air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

air_quality["ratio_paris_antwerp"] = (
    air_quality["station_paris"] / air_quality["station_antwerp"]
)

air_quality_renamed = air_quality.rename(
    columns={
        "station_antwerp": "BETR801",
        "station_paris": "FR04014",
        "station_london": "London Westminster",
    }
)

要約統計量

titanic["Age"].mean()

titanic[["Age", "Fare"]].median()

titanic[["Age", "Fare"]].describe()

titanic.agg(
    {
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    }
)
titanic[["Sex", "Age"]].groupby("Sex").mean()

titanic.groupby("Sex").mean(numeric_only=True)

titanic.groupby("Sex")["Age"].mean()

titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

titanic["Pclass"].value_counts()

titanic.groupby("Pclass")["Pclass"].count()

レイアウト変更

titanic.sort_values(by="Age").head()

titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()


no2 = air_quality[air_quality["parameter"] == "no2"]
no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset.pivot(columns="location", values="value")
no2.pivot(columns="location", values="value").plot()

air_quality.pivot_table(
    values="value", index="location", columns="parameter", aggfunc="mean"
)

air_quality.pivot_table(
    values="value",
    index="location",
    columns="parameter",
    aggfunc="mean",
    margins=True,
)



no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
no_2 = no2_pivoted.melt(id_vars="date.utc")
no_2 = no2_pivoted.melt(
    id_vars="date.utc",
    value_vars=["BETR801", "FR04014", "London Westminster"],
    value_name="NO_2",
    var_name="id_location",
)
air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",
                              parse_dates=True)
air_quality_no2 = air_quality_no2[["date.utc", "location",
                                   "parameter", "value"]]

air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",
                               parse_dates=True)
air_quality_pm25 = air_quality_pm25[["date.utc", "location",
                                     "parameter", "value"]]

air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)

air_quality = air_quality.sort_values("date.utc")
air_quality_ = pd.concat([air_quality_pm25, air_quality_no2], keys=["PM25", "NO2"])


stations_coord = pd.read_csv("data/air_quality_stations.csv")
air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")

air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")
air_quality = pd.merge(air_quality, air_quality_parameters,
                       how='left', left_on='parameter', right_on='id')

タイトルとURLをコピーしました