创建大型Pandas DataFrame：预分配vs追加vs concat（如何使用pandas进行数据预处理）

25-03-20 8

本文将分享创建大型PandasDataFrame：预分配vs追加vsconcat的详细内容，并且还将对如何使用pandas进行数据预处理进行详尽解释，此外，我们还将为大家带来关于Pandas-使用一个

本文将分享创建大型Pandas DataFrame：预分配vs追加vs concat的详细内容，并且还将对如何使用pandas进行数据预处理进行详尽解释，此外，我们还将为大家带来关于Pandas - 使用一个 Dataframe 列的子字符串比较两个 Dataframe、Pandas Dataframe - 迭代和分配、pandas dataframe 与 spark dataframe 互相转换（数据类型应该怎么转换呢？）、Pandas DataFrame使用另一个DataFrame列过滤行的相关知识，希望对你有所帮助。

本文目录一览：

创建大型Pandas DataFrame：预分配vs追加vs concat（如何使用pandas进行数据预处理）
Pandas - 使用一个 Dataframe 列的子字符串比较两个 Dataframe
Pandas Dataframe - 迭代和分配
pandas dataframe 与 spark dataframe 互相转换（数据类型应该怎么转换呢？）
Pandas DataFrame使用另一个DataFrame列过滤行

创建大型Pandas DataFrame：预分配vs追加vs concat（如何使用pandas进行数据预处理）

逐块构建大型数据帧时，Pandas的性能令我感到困惑。在Numpy中，通过预分配一个大的空数组然后填充值，我们（几乎）总是可以看到更好的性能。据我了解，这是由于Numpy立即获取其所需的所有内存，而不是每次append操作都必须重新分配内存。

在Pandas中，通过使用该df = df.append(temp)模式，我似乎获得了更好的性能。

这是一个带有计时的例子。Timer该类的定义如下。如您所见，我发现预分配比使用append！慢大约10倍。使用np.empty适当的dtype值预分配数据帧有很大帮助，但是该append方法仍然是最快的。

import numpy as npfrom numpy.random import randimport pandas as pdfrom timer import Timer# Some constantsnum_dfs = 10  # Number of random dataframes to generaten_rows = 2500n_cols = 40n_reps = 100  # Number of repetitions for timing# Generate a list of num_dfs dataframes of random valuesdf_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]### Define two methods of growing a large dataframe### Method 1 - append dataframesdef method1():    out_df1 = pd.DataFrame(columns=np.arange(4))    for df in df_list:        out_df1 = out_df1.append(df, ignore_index=True)    return out_df1def method2():# # Create an empty dataframe that is big enough to hold all the dataframes in df_listout_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows))#EDIT_1: Set the dtypes of each columnfor ix, col in enumerate(out_df2.columns):    out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix])# Fill in the valuesfor ix, df in enumerate(df_list):    out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.valuesreturn out_df2# EDIT_2: # Method 3 - preallocate dataframe with np.empty data of appropriate typedef method3():    # Create fake data array    data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))    # Create placeholder dataframe    out_df3 = pd.DataFrame(data)    # Fill in the real values    for ix, df in enumerate(df_list):        out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values    return out_df3### Time both methods### Time Method 1times_1 = np.empty(n_reps)for i in np.arange(n_reps):    with Timer() as t:       df1 = method1()    times_1[i] = t.secsprint ''Total time for %d repetitions of Method 1: %f [sec]'' % (n_reps, np.sum(times_1))print ''Best time: %f'' % (np.min(times_1))print ''Mean time: %f'' % (np.mean(times_1))#>>  Total time for 100 repetitions of Method 1: 2.928296 [sec]#>>  Best time: 0.028532#>>  Mean time: 0.029283# Time Method 2times_2 = np.empty(n_reps)for i in np.arange(n_reps):    with Timer() as t:        df2 = method2()    times_2[i] = t.secsprint ''Total time for %d repetitions of Method 2: %f [sec]'' % (n_reps, np.sum(times_2))print ''Best time: %f'' % (np.min(times_2))print ''Mean time: %f'' % (np.mean(times_2))#>>  Total time for 100 repetitions of Method 2: 32.143247 [sec]#>>  Best time: 0.315075#>>  Mean time: 0.321432# Time Method 3times_3 = np.empty(n_reps)for i in np.arange(n_reps):    with Timer() as t:        df3 = method3()    times_3[i] = t.secsprint ''Total time for %d repetitions of Method 3: %f [sec]'' % (n_reps, np.sum(times_3))print ''Best time: %f'' % (np.min(times_3))print ''Mean time: %f'' % (np.mean(times_3))#>>  Total time for 100 repetitions of Method 3: 6.577038 [sec]#>>  Best time: 0.063437#>>  Mean time: 0.065770

我非常Timer感谢Huy Nguyen：

# credit: http://www.huyng.com/posts/python-performance-analysis/import timeclass Timer(object):    def __init__(self, verbose=False):        self.verbose = verbose    def __enter__(self):        self.start = time.clock()        return self    def __exit__(self, *args):        self.end = time.clock()        self.secs = self.end - self.start        self.msecs = self.secs * 1000  # millisecs        if self.verbose:            print ''elapsed time: %f ms'' % self.msecs

如果您仍在关注，我有两个问题：

1）为什么该append方法更快？（注意：对于非常小的数据帧，即n_rows = 40，实际上速度较慢）。

2）用块构建大型数据框的最有效方法是什么？（就我而言，这些块都是大型的csv文件）。

谢谢你的帮助！

EDIT_1：在我的真实世界项目中，这些列具有不同的dtype。因此pd.DataFrame(....dtype=some_type)，按照BrenBarn的建议，我无法使用技巧来提高预分配的性能。dtype参数将所有列强制为相同的dtype。问题[4464]](https://github.com/pydata/pandas/issues/4464)

我method2()在代码中添加了几行，以更改dtypes逐列以匹配输入数据帧。此操作很昂贵，并且在写入行块时抵消了具有适当dtypes的好处。

EDIT_2：尝试使用占位符array预分配数据框np.empty(... dtyp=some_type)。根据@Joris的建议。

答案1

小编典典

您的基准实际上太小，无法显示出真正的差异。追加，每次复制，因此您实际上是在复制大小为N的存储空间N
*（N-1）次。随着数据框大小的增加，这效率极低。在很小的框架内，这当然可能无关紧要。但是，如果您有任何实际尺寸，这很重要。这在此处的文档中特别指出，尽管有点警告。

In [97]: df = DataFrame(np.random.randn(100000,20))In [98]: df[''B''] = ''foo''In [99]: df[''C''] = pd.Timestamp(''20130101'')In [103]: df.info()<class ''pandas.core.frame.DataFrame''>Int64Index: 100000 entries, 0 to 99999Data columns (total 22 columns):0     100000 non-null float641     100000 non-null float642     100000 non-null float643     100000 non-null float644     100000 non-null float645     100000 non-null float646     100000 non-null float647     100000 non-null float648     100000 non-null float649     100000 non-null float6410    100000 non-null float6411    100000 non-null float6412    100000 non-null float6413    100000 non-null float6414    100000 non-null float6415    100000 non-null float6416    100000 non-null float6417    100000 non-null float6418    100000 non-null float6419    100000 non-null float64B     100000 non-null objectC     100000 non-null datetime64[ns]dtypes: datetime64[ns](1), float64(20), object(1)memory usage: 17.5+ MB

追加中

In [85]: def f1():   ....:     result = df   ....:     for i in range(9):   ....:         result = result.append(df)   ....:     return result   ....:

康卡特

In [86]: def f2():   ....:     result = []   ....:     for i in range(10):   ....:         result.append(df)   ....:     return pd.concat(result)   ....:In [100]: f1().equals(f2())Out[100]: TrueIn [101]: %timeit f1()1 loops, best of 3: 1.66 s per loopIn [102]: %timeit f2()1 loops, best of 3: 220 ms per loop

请注意，我什至都不会尝试预分配。它有些复杂，特别是因为您要处理多个dtypes（例如，您可以
制作一个巨大的框架并且简单.loc并且可以工作）。但是pd.concat只是简单，可靠和快速。

并从上方选择尺寸

In [104]: df = DataFrame(np.random.randn(2500,40))In [105]: %timeit f1()10 loops, best of 3: 33.1 ms per loopIn [106]: %timeit f2()100 loops, best of 3: 4.23 ms per loop

Pandas - 使用一个 Dataframe 列的子字符串比较两个 Dataframe

我能够使用下面的方法获得所需的输出

df1.merge(df2,left_on = df2.prod_ref.str.extract(''(\d+)'',expand = False),right_on = df1.prod_id.str.extract(''(\d+)'',how = ''left'')

Pandas Dataframe - 迭代和分配

这是一个带有嵌套 iterrows 的解决方案，它有效但效率不高。我很想知道是否有人提供了更有效的矢量化解决方案：

for idx,row in final_data.iterrows():
    if row['score ratio'] < 0.05:
        min_distance = math.inf
        target_index = -1
        for idx2,row2 in final_data.iterrows():
            if row2['Teacher'] == row['Teacher'] and\
                    row2['Class'] == row['Class'] and\
                    row2['Percent'] > row['Percent'] and\
                    row2['Percent'] - row['Percent'] < min_distance:
                min_distance = row2['Percent'] - row['Percent']
                target_index = idx2
        final_data.loc[idx,'new_assigned-student'] = final_data.loc[target_index,'Student'].astype(str)

#output:
   Teacher Class  Student  ...  Total Score_y  score ratio  new_assigned-student
0        P     A        1  ...             85     0.882353                   NaN
1        P     A        2  ...             85     0.117647                   NaN
2        N     A        3  ...             15     0.666667                   NaN
3        N     A        4  ...             15     0.333333                   NaN
4        N     B        1  ...             95     0.789474                   NaN
5        N     B        2  ...             95     0.210526                   NaN
6        P     B        3  ...              5     1.000000                   NaN
7        N     C        1  ...             84     0.714286                   NaN
8        N     C        2  ...             84     0.238095                   NaN
9        N     C        5  ...             84     0.047619                   2
10       P     C        3  ...             16     0.625000                   NaN
11       P     C        4  ...             16     0.375000                   NaN

这应该可以，只需使用 shift。假设你的分数是按老师/班级排序的，就像你的例子一样

final_data['new_assigned_student'] = final_data.groupby(['Teacher','Class'])['Student'].shift()
final_data.loc[final_data['score ratio']>0.05,'new_assigned_student'] = np.nan

结果

    Teacher    Class      Student    Total Score_x    Percent    Total Score_y    score ratio    new_assigned_student
--  ---------  -------  ---------  ---------------  ---------  ---------------  -------------  ----------------------
 0  P          A                1               75         43               85       0.882353                     nan
 1  P          A                2               10         32               85       0.117647                     nan
 2  N          A                3               10         30               15       0.666667                     nan
 3  N          A                4                5         36               15       0.333333                     nan
 4  N          B                1               75         35               95       0.789474                     nan
 5  N          B                2               20         28               95       0.210526                     nan
 6  P          B                3                5         34                5       1                            nan
 7  N          C                1               60         33               84       0.714286                     nan
 8  N          C                2               20         31               84       0.238095                     nan
 9  N          C                5                4         29               84       0.047619                       2
10  P          C                3               10         36               16       0.625                        nan
11  P          C                4                6         37               16       0.375                        nan

解决方案 2

这是一个更强大的解决方案，如果涉及更多的话

df3 = final_data
df_min_pct = (df3.groupby(['Teacher','Class'],as_index = False,sort = False)
                 .apply(lambda g: g.iloc[g.loc[g['score ratio']>0.05,'Percent'].argmin()])
)

此处 df_min_pct 显示了每个教师/班级组中得分最低且高于 0.05 的学生的详细信息：

    Teacher    Class      Student    Total Score_x    Percent    Total Score_y    score ratio
--  ---------  -------  ---------  ---------------  ---------  ---------------  -------------
 0  P          A                2               10         32               85       0.117647
 1  N          A                3               10         30               15       0.666667
 2  N          B                2               20         28               95       0.210526
 3  P          B                3                5         34                5       1
 4  N          C                2               20         31               84       0.238095
 5  P          C                3               10         36               16       0.625

现在我们与原始df合并，并从那些不相关的行中删除细节

df4 = df3.merge(df_min_pct[['Teacher','Class','Student']],on = ['Teacher',sort = False).rename(columns = {'Student_y':'new_assigned_student'})
df4.loc[df4['score ratio']>0.05,'new_assigned_student'] = np.nan

这会产生想要的结果

    Teacher    Class      Student_x    Total Score_x    Percent    Total Score_y    score ratio    new_assigned_student
--  ---------  -------  -----------  ---------------  ---------  ---------------  -------------  ----------------------
 0  P          A                  1               75         43               85       0.882353                     nan
 1  P          A                  2               10         32               85       0.117647                     nan
 2  N          A                  3               10         30               15       0.666667                     nan
 3  N          A                  4                5         36               15       0.333333                     nan
 4  N          B                  1               75         35               95       0.789474                     nan
 5  N          B                  2               20         28               95       0.210526                     nan
 6  P          B                  3                5         34                5       1                            nan
 7  N          C                  1               60         33               84       0.714286                     nan
 8  N          C                  2               20         31               84       0.238095                     nan
 9  N          C                  5                4         29               84       0.047619                       2
10  P          C                  3               10         36               16       0.625                        nan
11  P          C                  4                6         37               16       0.375                        nan

pandas dataframe 与 spark dataframe 互相转换（数据类型应该怎么转换呢？）

文章大纲

spark 2.x 版本
spark 3.2 版本及以上
参考文献

spark 2.x 版本

spark 2.4.8 版本：

https://spark.apache.org/docs/2.4.8/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.toPandas

源代码：

@since(1.3)
    def toPandas(self):
        """
        Returns the contents of this :class:`DataFrame

本文同步分享在博客“shiter”（CSDN）。
如有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一起分享。

Pandas DataFrame使用另一个DataFrame列过滤行

我会做merge

out = df1.merge(df2[['col1','col2']],on = 'col1',suffixes = ('','1')).query('col3>=col21').drop('col21',1)

out
Out[15]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

或reindex

out = df1[df1['col3'] >= df2.set_index('col1')['col2'].reindex(df1['col1']).values]
Out[19]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

您还可以使用map：

 df1.loc[df1.col3 >= df1.col1.map(df2.set_index("col1").col2)]

我的方法类似于@Ben_Yo的合并答案，但是代码更多，但也许更直接。

您只需：

合并该列并创建新的数据框ZStack{ Rectangle() .frame(width: geometry.size.width,height: geometry.size.height/3.25) .shadow(radius: 5) .foregroundColor(Color.white) //Words ontop of the Rectangle VStack { HStack { Spacer() Text("Hello World") }.padding(.trailing,40) Spacer() //<-- PROBLEM HERE }//.offset(y: -40) }
根据条件（在本例中为s

s

True

False

最后，将s['col3'] >= s['col2']传递给s，结果将排除布尔系列df1中返回False的行：

我们今天的关于创建大型Pandas DataFrame：预分配vs追加vs concat和如何使用pandas进行数据预处理的分享已经告一段落，感谢您的关注，如果您想了解更多关于Pandas - 使用一个 Dataframe 列的子字符串比较两个 Dataframe、Pandas Dataframe - 迭代和分配、pandas dataframe 与 spark dataframe 互相转换（数据类型应该怎么转换呢？）、Pandas DataFrame使用另一个DataFrame列过滤行的相关信息，请在本站查询。

本文标签：