本文将为您提供关于使用itertools.groupby性能进行numpy分组的详细介绍,我们还将为您解释pythonitertoolsgroupby用法的相关知识,同时,我们还将为您提供关于as_i
本文将为您提供关于使用itertools.groupby性能进行numpy分组的详细介绍,我们还将为您解释python itertools groupby用法的相关知识,同时,我们还将为您提供关于as_index = False时,groupby.first,groupby.nth,groupby.head有什么区别、groupBy 和 addGroupBy 不分组、GroupBy分组的运用和linq左连接、itertools.groupby()/itertools.compress() 笔记的实用信息。
本文目录一览:- 使用itertools.groupby性能进行numpy分组(python itertools groupby用法)
- as_index = False时,groupby.first,groupby.nth,groupby.head有什么区别
- groupBy 和 addGroupBy 不分组
- GroupBy分组的运用和linq左连接
- itertools.groupby()/itertools.compress() 笔记
使用itertools.groupby性能进行numpy分组(python itertools groupby用法)
我有很多大的(>
35,000,000)整数列表,其中将包含重复项。我需要获取列表中每个整数的计数。以下代码有效,但似乎很慢。还有人可以使用Python最好是Numpy更好地进行基准测试吗?
def group(): import numpy as np from itertools import groupby values = np.array(np.random.randint(0,1<<32,size=35000000),dtype=''u4'') values.sort() groups = ((k,len(list(g))) for k,g in groupby(values)) index = np.fromiter(groups,dtype=''u4,u2'')if __name__==''__main__'': from timeit import Timer t = Timer("group()","from __main__ import group") print t.timeit(number=1)
返回:
$ python bench.py 111.377498865
干杯!
*根据回复进行 *编辑 :
def group_original(): import numpy as np from itertools import groupby values = np.array(np.random.randint(0,1<<32,size=35000000),dtype=''u4'') values.sort() groups = ((k,len(list(g))) for k,g in groupby(values)) index = np.fromiter(groups,dtype=''u4,u2'')def group_gnibbler(): import numpy as np from itertools import groupby values = np.array(np.random.randint(0,1<<32,size=35000000),dtype=''u4'') values.sort() groups = ((k,sum(1 for i in g)) for k,g in groupby(values)) index = np.fromiter(groups,dtype=''u4,u2'')def group_christophe(): import numpy as np values = np.array(np.random.randint(0,1<<32,size=35000000),dtype=''u4'') values.sort() counts=values.searchsorted(values, side=''right'') - values.searchsorted(values, side=''left'') index = np.zeros(len(values),dtype=''u4,u2'') index[''f0'']=values index[''f1'']=counts #Erroneous result!def group_paul(): import numpy as np values = np.array(np.random.randint(0,1<<32,size=35000000),dtype=''u4'') values.sort() diff = np.concatenate(([1],np.diff(values))) idx = np.concatenate((np.where(diff)[0],[len(values)])) index = np.empty(len(idx)-1,dtype=''u4,u2'') index[''f0'']=values[idx[:-1]] index[''f1'']=np.diff(idx)if __name__==''__main__'': from timeit import Timer timings=[ ("group_original","Original"), ("group_gnibbler","Gnibbler"), ("group_christophe","Christophe"), ("group_paul","Paul"), ] for method,title in timings: t = Timer("%s()"%method,"from __main__ import %s"%method) print "%s: %s secs"%(title,t.timeit(number=1))
返回:
$ python bench.py Original: 113.385262966 secsGnibbler: 71.7464978695 secsChristophe: 27.1690568924 secsPaul: 9.06268405914 secs
尽管Christophe目前给出的结果不正确
答案1
小编典典我做类似这样的事情得到了3倍的改进:
def group(): import numpy as np values = np.array(np.random.randint(0,3298,size=35000000),dtype=''u4'') values.sort() dif = np.ones(values.shape,values.dtype) dif[1:] = np.diff(values) idx = np.where(dif>0) vals = values[idx] count = np.diff(idx)
as_index = False时,groupby.first,groupby.nth,groupby.head有什么区别
编辑: 我在np.nan
@ coldspeed,@ wen-ben,@
ALollz指出的字符串中犯的菜鸟错误。答案非常好,因此我不会删除此问题以保留这些答案。
原文:
我已经阅读了这个问题/答案[groupby.first()和groupby.head(1)有什么区别?
该答案说明差异在于处理NaN
价值上。但是,当我打电话给groupby
时as_index=False
,他们俩都选择了NaN
。
此外,Pandas具有groupby.nth
与和类似的功能head
,并且first
有什么差异groupby.first(),groupby.nth(0),groupby.head(1)
有as_index=False
?
下面的例子:
In [448]: df
Out[448]:
A B
0 1 np.nan
1 1 4
2 1 14
3 2 8
4 2 19
5 2 12
In [449]: df.groupby('A',as_index=False).head(1)
Out[449]:
A B
0 1 np.nan
3 2 8
In [450]: df.groupby('A',as_index=False).first()
Out[450]:
A B
0 1 np.nan
1 2 8
In [451]: df.groupby('A',as_index=False).nth(0)
Out[451]:
A B
0 1 np.nan
3 2 8
我看到`firs()’重置了索引,而其他2则没有。除此之外,还有什么区别吗?
groupBy 和 addGroupBy 不分组
查询中提到的两个 group by 属性 sales.deleted_at
和 sales.id
使得结果集的每个属性都是唯一的。我宁愿用 where 子句编写两个不同的查询来获取已删除和未删除的记录。
更新 1
我验证了 ORM 正在通过上述查询返回预期的响应。在响应中,我先获取未删除的记录,然后再获取删除的记录。
GroupBy分组的运用和linq左连接
最简单的分组
var conHistoryList = conHistoryData.GroupBy(g => g.personId);
就是conHistoryData是一个IQueryable<T>类型;
分组后组内排序
var conHistoryList = conHistoryData.GroupBy(g => g.personId).Select(g => g.OrderBy(c => c.addTime));
对数据分组之后在根据每一个分组内的某一个元素排序。
分组后排序返回第一个
var conHistoryList = conHistoryData.GroupBy(g => g.personId).Select(g => g.OrderBy(c => c.addTime).FirstOrDefault()).ToList();
对数据分组之后在根据每一个分组内的某一个元素排序。排序之后在返回第一个元素,然后组成一个集合返回。这个实现的是根据分组取组内第一个元素然后重新返回一个元素列表。
使用组内元素
var dd = conHistoryData.GroupBy(g => g.personId);
//获取每个分组的键
foreach (var item in dd)
{
string GroupByKey = item.Key;
//获取每个分组的内容
foreach (var item2 in item)
{
ContractHistoryInfor historyInfor = item2; } }
多次分组组合
var childDetailListData = new List<ChildFactDetailInfor>();
var childDetailListG = childDetailList.GroupBy(b => b.materialCategory).Select(g => g.GroupBy(b => b.materialNum));
childDetailList是一个List<ChildFactDetailInfor>集合。通过先分组materialCategory字段,然后在在分组内根据materialNum字段再次分组。
使用多次分组的元素
var childDetailListData = new List<ChildFactDetailInfor>();
var childDetailListG = childDetailList.GroupBy(b => b.materialCategory).Select(g => g.GroupBy(b => b.materialNum));
foreach (var item in childDetailListG)
{
foreach (var item2 in item)
{
decimal? a = item2.Sum(x => x.materialNum); decimal? c = item2.Sum(x => x.priceTotal); foreach (var item3 in item2) { item3.materialNum = a; item3.priceTotal = c; childDetailListData.Add(item3); } } }
和上面使用元素差不多就是遍历层数
补充记录一个linq左连接多表关联去除控数据示例:


var data = (from a in childData
join bj in bjCostData
on a.id equals bj.childId
into bj
from bje in bj.DefaultIfEmpty()
join bjt in bjRefundData
on a.id equals bjt.childId
into bjt
from bjte in bjt.DefaultIfEmpty()
//学平险
join xpx in xpxCostData
on a.id equals xpx.childId
into xpx
from xpxe in xpx.DefaultIfEmpty()
join xpxt in xpxRefundData
on a.id equals xpxt.childId
into xpxt
from xpxte in xpxt.DefaultIfEmpty()
//餐费
join cf in cfCostData
on a.id equals cf.childId
into cf
from cfe in cf.DefaultIfEmpty()
join cft in cfRefundData
on a.id equals cft.childId
into cft
from cfte in cft.DefaultIfEmpty()
//自定义
join zdy in zdyCostData
on a.id equals zdy.childId
into zdy
from zdye in zdy.DefaultIfEmpty()
join zdyt in zdyRefundData
on a.id equals zdyt.childId
into zdyt
from zdyte in zdyt.DefaultIfEmpty()
//休园
join xy in xyCostData
on a.id equals xy.childId
into xy
from xye in xy.DefaultIfEmpty()
join xyt in xyRefundData
on a.id equals xyt.childId
into xyt
from xyte in xyt.DefaultIfEmpty()
//押金
join yj in yjCostData
on a.id equals yj.childId
into yj
from yje in yj.DefaultIfEmpty()
join yjt in yjRefundData
on a.id equals yjt.childId
into yjt
from yjte in yjt.DefaultIfEmpty()
select new H_ChildStatistics
{
id = a.id,
parkId = a.parkId,
parkName = a.parkName,
childName = a.childName,
childNameEng = a.childNameEng,
gradeNo = a.gradeNo,
classNo = a.classNo,
modifyTime = a.modifyTime,
bjfTotalReceive = bje == null ? 0 : bje.payTotalMoney,
bjfTotalRefund = bjte == null ? 0 : bjte.payTotalMoney,
xpxTotalReceive = xpxe == null ? 0 : xpxe.payTotalMoney,
xpxTotalRefund = xpxte == null ? 0 : xpxte.payTotalMoney,
cfTotalReceive = cfe == null ? 0 : cfe.payTotalMoney,
cfTotalRefund = cfte == null ? 0 : cfte.payTotalMoney,
xyglfTotalReceive = xye == null ? 0 : xye.payTotalMoney,
xyglfTotalRefund = xyte == null ? 0 : xyte.payTotalMoney,
yjTotalReceive = yje == null ? 0 : yje.payTotalMoney,
yjTotalRefund = yjte == null ? 0 : yjte.payTotalMoney,
zdyTotalReceive = zdye == null ? 0 : zdye.payTotalMoney,
zdyTotalRefund = zdyte == null ? 0 : zdyte.payTotalMoney,
childTotalReceive = ((bje == null ? 0 : bje.payTotalMoney) + (xpxe == null ? 0 : xpxe.payTotalMoney) + (cfe == null ? 0 : cfe.payTotalMoney) + (xye == null ? 0 : xye.payTotalMoney) + (yje == null ? 0 : yje.payTotalMoney) + (zdye == null ? 0 : zdye.payTotalMoney)),
childTotalRefund = ((bjte == null ? 0 : bjte.payTotalMoney) + (xpxte == null ? 0 : xpxte.payTotalMoney) + (cfte == null ? 0 : cfte.payTotalMoney) + (xyte == null ? 0 : xyte.payTotalMoney) + (yjte == null ? 0 : yjte.payTotalMoney) + (zdyte == null ? 0 : zdyte.payTotalMoney)),
});
itertools.groupby()/itertools.compress() 笔记
关于itertools.groupby()
itertools.groupby()就是将相邻的并且相同的键值划分为同一组,相似功能可以看https://docs.python.org/3/library/itertools.html?highlight=groupby#itertools.groupby写的groupby类
>>> list_a
[''A'', ''A'', ''A'', ''A'', ''B'', ''B'', ''B'', ''C'', ''C'', ''D'', ''A'', ''A'', ''B'', ''B'', ''B'']
>>> for date, items in groupby(list_a):
... print(''date: {}''.format(date))
... for item in items:
... print(item, end=" ")
... print("\n==========")
...
date: A
A A A A
==========
date: B
B B B
==========
date: C
C C
==========
date: D
D
==========
date: A
A A
==========
date: B
B B B
==========
是不是发现上述例子还有可简化之处,毕竟A的分组要都归为一组(这是因为存在不相邻的A才出现的情况):
>>> list_a
[''A'', ''A'', ''A'', ''A'', ''B'', ''B'', ''B'', ''C'', ''C'', ''D'', ''A'', ''A'', ''B'', ''B'', ''B'']
>>> list_a.sort(key=lambda list: list) # 经过lambda匿名函数排序后,将相邻的元素放在一起
>>> list_a
[''A'', ''A'', ''A'', ''A'', ''A'', ''A'', ''B'', ''B'', ''B'', ''B'', ''B'', ''B'', ''C'', ''C'', ''D'']
>>> for date, items in groupby(list_a):
... print(''date: {}''.format(date))
... for item in items:
... print(item, end=" ")
... print("\n==========")
...
date: A
A A A A A A
==========
date: B
B B B B B B
==========
date: C
C C
==========
date: D
D
==========
除了使用lambda匿名函数之外,还可以使用operator.itemgetter()函数,效率比lambda更快一些,具体可以看《Python Cookbook》
关于itertools.compress(data, selectors)
根据传递进去的选择器进行判断是否保留数据
>>> list1 = [1, 4, 7, 2, 98, 3, 6, 2]
>>> list_TF = [0,1,0,1,1,1,0,0]
>>> list_TF = [n ==1 for n in list_TF]
>>> list_TF
[False, True, False, True, True, True, False, False]
>>> from itertools import compress
>>> list(compress(list1, list_TF))
[4, 2, 98, 3]
其实通过教程我们还可以发现compress是大致如下:
>>> list1
[1, 4, 7, 2, 98, 3, 6, 2]
>>> list_TF
[False, True, False, True, True, True, False, False]
>>> [n for n,s in zip(list1, list_TF) if s]
[4, 2, 98, 3]
如果觉得慢,还可以使用生成器来代替
今天的关于使用itertools.groupby性能进行numpy分组和python itertools groupby用法的分享已经结束,谢谢您的关注,如果想了解更多关于as_index = False时,groupby.first,groupby.nth,groupby.head有什么区别、groupBy 和 addGroupBy 不分组、GroupBy分组的运用和linq左连接、itertools.groupby()/itertools.compress() 笔记的相关知识,请在本站进行查询。
本文标签: