「ウィスコンシン州の乳がんのデータセット」について学ぶ（scikit-learn /トイデータセット）

2023年4月4日2023年5月13日

この記事では、統計学を初めて学ぶ筆者が、「scikit-learn」の「トイデータセット」における「ウィスコンシン州の乳がんのデータセット」について学んだ内容について記載しています。

学習には、scikit-learnのガイド「7.1. Toy datasets」を参考にし、Pythonのプログラミングにも触れ、理解を深めました。

プログラミングには、機械学習ライブラリのscikit-learnを使用しました。

この記事は、他の人が参考にできるよう、わかりやすく書くことを心がけました。

Contents

1. scikit-learn トイデータセット
2. ウィスコンシン州の乳がんのデータセット
3. Pythonプログラミング
4. データセットの分布
- 4.1. Pythonプログラミング

scikit-learn トイデータセット

機械学習ライブラリscikit-learnに用意されている「トイデータセット」は、機械学習の問題を解くためのサンプルデータセットのことで、いくつかの種類が用意されています。例えば、Iris（アヤメ）の花の特徴から、その種類を分類する問題を解くための「irisデータセット」や、ボストン市の住宅価格に関するデータを用いて、住宅価格を予測する問題を解くための「bostonデータセット」などがあります。

scikit-learnにはいくつかの小さな標準データセットが付属しており、外部のウェブサイトからファイルをダウンロードする必要はありません。

これらは以下の関数を使って読み込むことができます。

load_boston()　:　load_boston は 1.0 で非推奨となり、1.2 で削除される予定である。

load_iris()　:　アヤメのデータセット(分類)をロードして返す。

load_diabetes()　:　糖尿病のデータセット(回帰)をロードして返す。

load_linnerud()　:　身体運動のデータセットをロードして返す。

load_digits()　:　数字のデータセット(分類)をロードして返す。

load_wine()　:　ワインのデータセット(分類)をロードして返す。

load_breast_cancer()　:　ウィスコンシン州の乳がんのデータセット（分類）をロードして返す。7.1. Toy datasetsts

「アヤメのデータセット」について学ぶ（scikit-learn /トイデータセット）

「糖尿病のデータセット」について学ぶ（scikit-learn /トイデータセット）

「数字のデータセット」について学ぶ（scikit-learn /トイデータセット）

「身体運動のデータセット」について学ぶ（scikit-learn /トイデータセット）

「ワインのデータセット」について学ぶ（scikit-learn /トイデータセット）

「ウィスコンシン州の乳がんのデータセット」について学ぶ（scikit-learn /トイデータセット）

ウィスコンシン州の乳がんのデータセット

「ウィスコンシン州の乳がんのデータセット」は、M. Wolberg, W. Street, そして O. L. Mangasarian によって収集された、乳がんの細胞核画像のデジタル化データセットです。このデータセットには、2つのクラス（良性、悪性）の乳がん細胞サンプルについて、30の特徴量が含まれています。

このデータセットには、569件のサンプルが含まれており、それぞれのサンプルには、次の30の特徴量が含まれています。

平均半径 (mean radius)
平均テクスチャ (mean texture)
平均周囲長 (mean perimeter)
平均面積 (mean area)
平均平滑度 (mean smoothness)
平均コンパクト性 (mean compactness)
平均凹面 (mean concavity)
平均凹点 (mean concave points)
平均対称性 (mean symmetry)
平均フラクタル次元 (mean fractal dimension)
半径の標準誤差 (radius error)
テクスチャの標準誤差 (texture error)
周囲長の標準誤差 (perimeter error)
面積の標準誤差 (area error)
平滑度の標準誤差 (smoothness error)
コンパクト性の標準誤差 (compactness error)
凹面の標準誤差 (concavity error)
凹点の標準誤差 (concave points error)
対称性の標準誤差 (symmetry error)
フラクタル次元の標準誤差 (fractal dimension error)
最悪の半径 (worst radius)
最悪のテクスチャ (worst texture)
最悪の周囲長 (worst perimeter)
最悪の面積 (worst area)
最悪のなめらかさ (worst smoothness)
最悪のコンパクト性 (worst compactness)
最悪の凹面 (worst concavity)
最悪の凹点 (worst concave points)
最悪の対称性 (worst symmetry)
最悪のフラクタル次元 (worst fractal dimension)

このデータセットの目的は、乳がんが良性か悪性かを予測することです。良性を表すクラスは2で、悪性を表すクラスは4です。クラス分類のために、データセットにはクラスラベルが含まれています。また、データには欠損値は含まれておらず、すべてのデータが数値型で表現されています。

Pythonプログラミング

「ウィスコンシン州の乳がんのデータセット」をイメージしやすいようPythonでのプログラミングについても学びます。

プログラム

ウィスコンシン州の乳がんのデータセットを読み込み、データセットの詳細を出力します。

from sklearn.datasets import load_breast_cancer

# 乳がんのデータセットを読み込む
cancer = load_breast_cancer()

# データセットの詳細を出力する
print("ウィスコンシン州の乳がんのデータセットの定義:")
print("----------------------------")
print("データセット名:", cancer['DESCR'])
print("----------------------------")
print("特徴量名:", cancer['feature_names'])
print("----------------------------")
print("特徴量の数:", len(cancer['feature_names']))
print("----------------------------")
print("クラス名:", cancer['target_names'])
print("----------------------------")
print("クラスの数:", len(cancer['target_names']))
print("----------------------------")
print("データ数:", len(cancer['data']))
print("----------------------------")
print("データの先頭5行:")
print(cancer['data'][:5])
print("----------------------------")
print("クラスの先頭5行:")
print(cancer['target'][:5])

実行結果

ウィスコンシン州の乳がんのデータセットの定義:
----------------------------
データセット名: .. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius, field
        10 is Radius SE, field 20 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.
----------------------------
特徴量名: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
----------------------------
特徴量の数: 30
----------------------------
クラス名: ['malignant' 'benign']
----------------------------
クラスの数: 2
----------------------------
データ数: 569
----------------------------
データの先頭5行:
[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]
 [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-01
  1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00 2.723e+01
  9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03 1.491e+01
  2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01 2.575e-01
  6.638e-01 1.730e-01]
 [2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-01
  1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00 9.444e+01
  1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03 2.254e+01
  1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-01
  2.364e-01 7.678e-02]]
----------------------------
クラスの先頭5行:
[0 0 0 0 0]

和訳します。（.. topic:: Referencesを除きます。）

ウィスコンシン州の乳がんのデータセットの定義:
----------------------------
データセット名: .. _breast_cancer_dataset:

乳がんウィスコンシン（診断）データセット

データセットの特徴：

: インスタンスの数：569個

: 属性の数：30個の数値予測属性とクラス

: 属性情報：
    - 半径（周囲の点から中心までの距離の平均値）
    - テクスチャ（グレースケール値の標準偏差）
    - 周囲長
    - 面積
    - 滑らかさ（半径の長さの局所的な変動）
    - コンパクト性（周囲長^2 / 面積 - 1.0）
    - 凹面度（輪郭の凹部の深さ）
    - 凹点数（輪郭の凹部の数）
    - 対称性
    - フラクタル次元（「海岸線の近似」- 1）

    これらの特徴量の平均値、標準誤差、そして最悪または最大の（3つの最悪/最大値の平均）が、各画像に対して計算され、30の特徴量が得られます。たとえば、フィールド0は平均半径、フィールド10は半径SE、フィールド20は最悪半径です。

    - クラス：
            - WDBC-Malignant（悪性）
            - WDBC-Benign（良性）

:要約統計量：

===================================== ====== ======
                                       最小    最大
===================================== ====== ======
半径（平均）：                          6.981  28.11
質感（平均）：                          9.71   39.28
周長（平均）：                          43.79  188.5
面積（平均）：                          143.5  2501.0
滑らかさ（平均）：                       0.053  0.163
コンパクト性（平均）：                   0.019  0.345
凹面（平均）：                          0.0    0.427
凹点（平均）：                          0.0    0.201
対称性（平均）：                         0.106  0.304
フラクタル次元（平均）：                  0.05   0.097
半径（標準誤差）：                       0.112  2.873
質感（標準誤差）：                       0.36   4.885
周長（標準誤差）：                       0.757  21.98
面積（標準誤差）：                       6.802  542.2
滑らかさ（標準誤差）：                    0.002  0.031
コンパクト性（標準誤差）：                0.002  0.135
凹面（標準誤差）：                       0.0    0.396
凹点（標準誤差）：                       0.0    0.053
対称性（標準誤差）：                      0.008  0.079
フラクタル次元（標準誤差）：              0.001  0.03
半径（最悪値）：                         7.93   36.04
質感（最悪値）：                         12.02  49.54
周長（最悪値）：                         50.41  251.2
面積（最悪値）：                         185.2  4254.0
滑らかさ（最悪値）：                      0.071  0.223
コンパクト性（最悪値）：                  0.027  1.058
凹面（最悪値）：                         0.0    1.252
凹点（最悪値）：                         0.0    0.291
対称性（最悪）：                         0.156  0.664
フラクタル次元（最悪）：                  0.055  0.208
===================================== ====== ======

・欠損属性値：なし
・クラスの分布：212 - 悪性、357 - 良性
・作成者：Dr. William H. Wolberg、W. Nick Street、Olvi L. Mangasarian
・提供者：Nick Street
・日付：1995年11月

これはUCI ML乳がんウィスコンシン（診断）データセットのコピーです。
https://goo.gl/U2Uwz2

特徴量は、乳房の腫瘤の細胞核の特徴を記述します。
これらの特徴は、細胞核のデジタル画像から計算されます。

上記で説明した分離平面は、線形計画法を使用して決定木を構築する分類方法であるMultisurface Method-Tree（MSM-T）[K. P. Bennett、「Decision Tree Construction Via Linear Programming.」第4回Midwest Artificial Intelligence and Cognitive Science Societyの論文集、pp. 97-101、1992]で得られました。
関連する特徴量は、1〜4の特徴量と1〜3の分離平面の空間での網羅的な検索を使用して選択されました。

3次元空間で分離平面を取得するために使用された実際の線形プログラムは、次の論文で説明されています。
[K. P. BennettおよびO. L. Mangasarian：「Robust Linear Programming Discrimination of Two Linearly Inseparable Sets」、Optimization Methods and Software 1、1992、23-34]。

このデータベースは、UW CS ftpサーバーを介しても利用できます。

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

特徴量名: ['平均半径' '平均テクスチャ' '平均周長' '平均面積' '平均なめらかさ' '平均コンパクト性' '平均凹みの深さ'
'平均凹点の距離' '平均対称性' '平均フラクタル次元' '半径の標準誤差' 'テクスチャの標準誤差' '周長の標準誤差'
'面積の標準誤差' 'なめらかさの標準誤差' 'コンパクト性の標準誤差' '凹みの深さの標準誤差'
'凹点の距離の標準誤差' '対称性の標準誤差' 'フラクタル次元の標準誤差' '最悪の半径' '最悪のテクスチャ'
'最悪の周長' '最悪の面積' '最悪のなめらかさ' '最悪のコンパクト性' '最悪の凹みの深さ'
'最悪の凹点の距離' '最悪の対称性' '最悪のフラクタル次元']
----------------------------
特徴量の数: 30
----------------------------
クラス名: ['malignant' 'benign']
----------------------------
クラスの数: 2
----------------------------
データ数: 569
----------------------------
データの先頭5行:
[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]
 [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-01
  1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00 2.723e+01
  9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03 1.491e+01
  2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01 2.575e-01
  6.638e-01 1.730e-01]
 [2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-01
  1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00 9.444e+01
  1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03 2.254e+01
  1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-01
  2.364e-01 7.678e-02]]
----------------------------
クラスの先頭5行:
[0 0 0 0 0]

プログラムの説明

from sklearn.datasets import load_breast_cancer

scikit-learnのdatasetsモジュールから、load_breast_cancer関数をインポートします。これにより、breast_cancerデータセットを読み込むことができます。

# 乳がんのデータセットを読み込む
cancer = load_breast_cancer()

load_breast_cancer()関数を使って、breast_cancerデータセットを読み込み、irisという名前で保存します。

# データセットの詳細を出力する
print("ウィスコンシン州の乳がんのデータセットの定義:")
print("----------------------------")
print("データセット名:", cancer['DESCR'])
print("----------------------------")
print("特徴量名:", cancer['feature_names'])
print("----------------------------")
print("特徴量の数:", len(cancer['feature_names']))
print("----------------------------")
print("クラス名:", cancer['target_names'])
print("----------------------------")
print("クラスの数:", len(cancer['target_names']))
print("----------------------------")
print("データ数:", len(cancer['data']))
print("----------------------------")

print(“データセット名:", breast_cancer['DESCR’])は、ウィスコンシン州の乳がんのデータセットの詳細情報を出力します。
print(“特徴量名:", breast_cancer['feature_names’])は、ウィスコンシン州の乳がんのデータセットの特徴量名を出力します。
print(“特徴量の数:", len(breast_cancer['feature_names’]))は、ウィスコンシン州の乳がんのデータセットの特徴量の数を出力します。
print(“クラス名:", breast_cancer['target_names’])は、ウィスコンシン州の乳がんのデータセットのクラス名を出力します。
print(“クラスの数:", len(breast_cancer['target_names’]))は、ウィスコンシン州の乳がんのデータセットのクラスの数を出力します。
print(“データ数:", len(breast_cancer['data’]))は、ウィスコンシン州の乳がんのデータセットのデータの数を出力します。

print("データの先頭5行:")
print(cancer['data'][:5])
print("----------------------------")
print("クラスの先頭5行:")
print(cancer['target'][:5])

print(breast_cancer['data’][:5])は、ウィスコンシン州の乳がんのデータセットの最初の5行を出力します。

breast_cancer['data’]は、ウィスコンシン州の乳がんのデータセットから取得したデータの行列を表します。各行が1つのデータポイントであり、各列が4つの特徴量の値を表します。breast_cancer['data’][:5]は、最初の5つの行を取得するためのPythonのスライスです。したがって、print(breast_cancer['data’][:5])は、最初の5つのデータポイントの特徴量の値を表示します。

print(breast_cancer['target’][:5])は、ウィスコンシン州の乳がんのデータセットのクラスの最初の5行を出力します。

breast_cancer['target’]は、ウィスコンシン州の乳がんのデータセットから取得したクラスラベルを表します。各クラスラベルは、各データポイントに対応しています。breast_cancer['target’][:5]は、最初の5つのクラスラベルを取得するためのPythonのスライスです。したがって、print(breast_cancer['target’][:5])は、最初の5つのデータポイントのクラスラベルを表示します。

データセットの分布

Pythonプログラミング

「ウィスコンシン州の乳がんのデータセットの分布」をイメージしやすいようPythonでのプログラミングについても学びます。

プログラム

ウィスコンシン州の乳がんのデータセットの各特徴量について、WDBC-Malignant（悪性）、WDBC-Benign（良性）の分布をそれぞれヒストグラムで表示します。

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

# 乳がんデータセットの読み込み
breast_cancer_data = load_breast_cancer()

# 特徴量ごとに各クラスのヒストグラムを描画
for i, feature_name in enumerate(breast_cancer_data.feature_names):
    # Malignant のヒストグラムを描画
    plt.hist(breast_cancer_data.data[breast_cancer_data.target == 0, i], 
             bins=50, alpha=0.5, label='Malignant')
    
    # Benign のヒストグラムを描画
    plt.hist(breast_cancer_data.data[breast_cancer_data.target == 1, i], 
             bins=50, alpha=0.5, label='Benign')
    
    plt.xlabel(feature_name)
    plt.ylabel("Number of samples")
    plt.legend(loc="best")
    plt.show()

実行結果

プログラムの説明

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

Matplotlibのpyplotモジュールをインポートし、pltという名前で呼び出します。

scikit-learnライブラリからload_breast_cancer関数をインポートします。

# 乳がんデータセットの読み込み
breast_cancer_data = load_breast_cancer()

scikit-learnのdatasetsモジュールから、load_breast_cancer関数をインポートします。これにより、breast_cancerデータセットを読み込むことができます。

# 特徴量ごとに各クラスのヒストグラムを描画
for i, feature_name in enumerate(breast_cancer_data.feature_names):

ループを開始し、irisの特徴量名を順番に取り出します。iは特徴量のインデックス、feature_nameは特徴量名を表します。

    # Malignant のヒストグラムを描画
    plt.hist(breast_cancer_data.data[breast_cancer_data.target == 0, i], 
             bins=50, alpha=0.5, label='Malignant')
    
    # Benign のヒストグラムを描画
    plt.hist(breast_cancer_data.data[breast_cancer_data.target == 1, i], 
             bins=50, alpha=0.5, label='Benign')

breast_cancerのデータから、種類が0のものだけをi番目の特徴量だけを取り出して、ヒストグラムを描画します。alphaはグラフの透明度を指定します。
breast_cancerのデータから、種類が1のものだけをi番目の特徴量だけを取り出して、ヒストグラムを描画します。alphaはグラフの透明度を指定します。

    plt.xlabel(feature_name)
    plt.ylabel("Number of samples")
    plt.legend(loc="best")
    plt.show()

plt.xlabel(feature_name) はx軸ラベルをfeature_nameに設定します。
plt.ylabel(“Number of samples") はy軸ラベルを"Number of samples"に設定します。
plt.legend(loc="best") は凡例をグラフの最適な位置に表示します。
plt.show() は描画したヒストグラムを表示する。ループを回しているため、特徴量ごとに複数のヒストグラムが表示されます。

トイデータセットPython,scikit-learn,ウィスコンシン州の乳がんのデータセット,トイデータセット

Posted by Yamada

「数字のデータセット」について学ぶ（scikit-learn /トイデータセット）

「身体運動のデータセット」について学ぶ（scikit-learn /トイデータセット）

コメント一覧

まだ、コメントがありません