pandas 分类数据的使用

本内容来自：https://gairuo.com

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gairuo123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

本内容来自：https://gairuo.com

分类数据具有类别和有序属性，它们列出了它们的可能值以及排序是否重要。这些属性以 s.cat.categories 和 s.cat.ordered 形式体现出来。如果您不手动指定类别和顺序，则可以从传递的参数中推断出它们。

顺序

新的分类数据不会自动排序。您必须显式传递 ordered=True 来指示有序的分类。

查看分类数据的顺序：

s = pd.Series(["a", "b", "c", "a"], dtype="category")s.cat.categories# Index(['a', 'b', 'c'], dtype='object')s.cat.ordered# False

也可以按特定顺序传递类别：

s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))s.cat.categories# Index(['c', 'b', 'a'], dtype='object')s.cat.ordered# False

unique() 的结果并不总是与 Series.cat.categories 相同，因为Series.unique() 具有两个保证，即它按出现的顺序返回类别，并且仅包括实际存在的值。

s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))s'''0b1a2b3cdtype: categoryCategories (4, object): [a, b, c, d]'''# categoriess.cat.categories# Index(['a', 'b', 'c', 'd'], dtype='object')# uniquess.unique()'''[b, a, c]Categories (3, object): [b, a, c]'''描述统计 Description

在分类数据上使用 describe() 会产生与字符串类型的 Series 或 DataFrame 类似的输出。

df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]})df.describe()'''cat scount3 3unique2 2top c cfreq 2 2'''df["cat"].describe()'''count 3unique2topcfreq 2Name: cat, dtype: object'''重命名类别

重命名类别是通过将新值分配给 Series.cat.categories 属性或使用rename_categories() 方法来完成的：

s = pd.Series(["a", "b", "c", "a"], dtype="category")s'''0a1b2c3adtype: categoryCategories (3, object): [a, b, c]'''s.cat.categories = ["Group %s" % g for g in s.cat.categories]s'''0Group a1Group b2Group c3Group adtype: categoryCategories (3, object): [Group a, Group b, Group c]'''s = s.cat.rename_categories([1, 2, 3])s'''01122331dtype: categoryCategories (3, int64): [1, 2, 3]'''# 使用字典重命名s = s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'})s'''0x1y2z3xdtype: categoryCategories (3, object): [x, y, z]'''

需要注意的是，指定的类型数据必须不重复，否则会引发 ValueError:

try:s.cat.categories = [1, 1, 1]except ValueError as e:print("ValueError:", str(e))# ValueError: Categorical categories must be unique

NaN 值也会 ValueError:

try:s.cat.categories = [1, 2, np.nan]except ValueError as e:print("ValueError:", str(e))# ValueError: Categorial categories cannot be null追加新的类别

可以使用 add_categories() 方法完成附加类别：

s = s.cat.add_categories([4])s.cat.categories# Index(['x', 'y', 'z', 4], dtype='object')s'''0x1y2z3xdtype: categoryCategories (4, object): [x, y, z, 4]'''删除类别

可以使用 remove_categories() 方法来删除类别，删除的值将替换为 np.nan。

s = s.cat.remove_categories([4])s'''0x1y2z3xdtype: categoryCategories (3, object): [x, y, z]'''

删除未使用的类别：

s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))s'''0a1b2adtype: categoryCategories (4, object): [a, b, c, d]'''s.cat.remove_unused_categories()'''0a1b2adtype: categoryCategories (2, object): [a, b]'''设置类别

如果您要一步一步地删除和添加新类别（这在速度方面有优势），或者只是将类别设置为预定义的，请使用 set_categories() 。

s = pd.Series(["one", "two", "four", "-"], dtype="category")s'''0 one1 two2four3-dtype: categoryCategories (4, object): [-, four, one, two]'''s = s.cat.set_categories(["one", "two", "three", "four"])s'''0 one1 two2four3 NaNdtype: categoryCategories (4, object): [one, two, three, four]'''

请注意 Categorical.set_categories() 无法知道某个类别是故意省略还是由于类型差异（例如，NumPy S1 dtype 和 Python 字符串）而拼写错误或（在 Python3 下）。这可能会导致令人惊讶的行为！

分类数据创建pandas 教程分类数据的顺序 >>

更新时间：2023-05-12 06:56:31标签：pandas分类数据