pandas

2016-08-24 11:15:01   0  举报





仅支持查看

AI智能生成

Pandas是一个开源的Python数据分析库，提供了大量的数据处理和分析功能。它是基于NumPy构建的，可以方便地处理结构化数据，如表格、时间序列等。Pandas提供了两种主要的数据结构：Series和DataFrame。Series是一维数组，适用于存储一维数据；而DataFrame是二维表格，适用于存储多维数据。Pandas提供了丰富的数据操作方法，如数据筛选、排序、合并、分组等，同时还支持对数据进行统计分析、可视化等操作。此外，Pandas还具有强大的性能和灵活的扩展性，可以与其他Python库（如Matplotlib、Seaborn等）无缝集成，广泛应用于金融、医疗、科研等领域。

作者其他创作

大纲/内容

DataFrame

赋值形式

源于Series字典

df = pandas.DataFrame(d) d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

pandas.DataFrame([d,e], index=['d', 'b', 'a'], columns=['two', 'three'])

源于N维数组

d = {'one' : [1., 2., 3., 4.],'two' : [4., 3., 2., 1.]} pandas.DataFrame(d, index=['a', 'b', 'c', 'd'])

子主题

源于已有结构

data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')]) i4是长度4的INT，f是FLOAT ,a是字符串 data[:] = [(1,2.,'Hello'), (2,3.,"World")]

源于元祖的字典

pandas.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

a b a b c a b A B 4.0 1.0 5.0 8.0 10.0 C 3.0 2.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0

子主题

它的固有属性

DataFrame.index 行名

DataFrame.columns 列名

DataFrame.value 内部的值

初步处理函数

DataFrame.describe()，计算简单的统计特性

DataFrame.T 转置

DataFrame.sort()以某一列排序，默认从小到大

切片

DataFrame['X']，X为列名

1.DataFrame[0:n] 取第一到第N行 2.DataFrame['行名':'行名']，取2行之间所有行

DataFrame.loc[a,b]a为行的数组，b为列的数组 EX:DataFrame.loc['1':'5',['A','B']]

逻辑选择

数字

DataFrame[表达式]，表达式可以是应用在整个DataFrame上的 ex:DataFrame[DataFrame>1]找出数据框内所有大于1的数 DataFrame[DataFrame.A>1]找出A列大于1的

非数字

DataFrame[DataFrame['E'].isin(['cuiwei','CW'])]

缺省值处理

DataFrame.dropna(how='any')

DataFrame.fillna('值')用值填缺省
DataFrame.fillna(method="ffill"/'bfill')用前值/后
DataFrame.fillna({列：值，列：值})

numpy.nan

对数据框应用函数（对一列或者一行，通过axis控制）

要求：函数的参数必须是一个array

形式：DataFrame.apply(函数名)

合并

Concat(列排列)

In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], ...: 'B': ['B0', 'B1', 'B2', 'B3'], ...: 'C': ['C0', 'C1', 'C2', 'C3'], ...: 'D': ['D0', 'D1', 'D2', 'D3']}, ...: index=[0, 1, 2, 3]) ...: In [2]: df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], ...: 'B': ['B4', 'B5', 'B6', 'B7'], ...: 'C': ['C4', 'C5', 'C6', 'C7'], ...: 'D': ['D4', 'D5', 'D6', 'D7']}, ...: index=[4, 5, 6, 7]) ...: In [3]: df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'], ...: 'B': ['B8', 'B9', 'B10', 'B11'], ...: 'C': ['C8', 'C9', 'C10', 'C11'], ...: 'D': ['D8', 'D9', 'D10', 'D11']}, ...: index=[8, 9, 10, 11]) ...: In [4]: frames = [df1, df2, df3] In [5]: result = pd.concat(frames)

merge（类似SQL的JOIN）

In [38]: left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], ....: 'A': ['A0', 'A1', 'A2', 'A3'], ....: 'B': ['B0', 'B1', 'B2', 'B3']}) ....: In [39]: right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], ....: 'C': ['C0', 'C1', 'C2', 'C3'], ....: 'D': ['D0', 'D1', 'D2', 'D3']}) ....: In [40]: result = pd.merge(left, right, on='key')

append（将一行连接到一个DataFrame上）

result = df1.append(df2)

分组执行

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ...: 'foo', 'bar', 'foo', 'foo'], ...: 'B' : ['one', 'one', 'two', 'three', ...: 'two', 'two', 'one', 'three'], ...: 'C' : np.random.randn(8), ...: 'D' : np.random.randn(8)}) grouped = df.groupby('A')

In [13]: df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]}) In [14]: df2.groupby(['X']).sum() Out[14]: Y X A 7 B 3

Reshape(个人倾向于叫分类)

In [8]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', ...: 'foo', 'foo', 'qux', 'qux'], ...: ['one', 'two', 'one', 'two', ...: 'one', 'two', 'one', 'two']])) ...: In [9]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) In [11]: df2 = df[:4] In [12]: df2 Out[12]: A B first second bar one 0.721555 -0.706771 two -1.039575 0.271860 baz one -0.424972 0.567020 two 0.276232 -1.087401 df.loc['bar'].loc['one']---> 0.721555 -0.706771

透视表

df Out[56]: A B C D E F 0 one A foo 0.341734 -0.317441 2013-01-01 1 one B foo 0.959726 -1.236269 2013-02-01 2 two C foo -1.110336 0.896171 2013-03-01 3 three A bar -0.619976 -0.487602 2013-04-01 4 one B bar 0.149748 -0.082240 2013-05-01 5 one C bar -0.732339 -2.182937 2013-06-01 6 two A foo 0.687738 0.380396 2013-07-01 .. ... .. ... ... ... ... 17 one C bar -0.345352 0.206053 2013-06-15 18 two A foo 1.314232 -0.251905 2013-07-15 19 three B foo 0.690579 -2.213588 2013-08-15 20 one C foo 0.995761 1.063327 2013-09-15 21 one A bar 2.396780 1.266143 2013-10-15 22 two B bar 0.014871 0.299368 2013-11-15 23 three C bar 3.357427 -0.863838 2013-12-15 pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[57]: C bar foo A B one A 1.120915 -0.514058 B -0.338421 0.002759 C -0.538846 0.699535 three A -1.181568 NaN B NaN 0.433512 C 0.588783 NaN two A NaN 1.000985 B 0.158248 NaN C NaN 0.176180

时间序列

时间表示

rng = pd.date_range('1/1/2011', periods=72, freq='H') ts = pd.Series(np.random.randn(len(rng)), index=rng)

时间聚合

ts.reshape('1D',how='sum')，聚合天数据同理，M月，H小时，Min分钟

子主题

http://www.cnblogs.com/prpl/p/5537417.html 函数相对全

Series

形式

x= pandas.Series(data, index=index) 不同长度时，补NA或者标量复制

data内数据类型

Python字典