Pyrallel - Parallel Data Analytics in Python

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.


  • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).

  • focus on small to medium data (with data locality when possible).

  • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.

  • do not focus on HA / Fault Tolerance (yet).

  • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.


The usual suspects: Python 2.7, NumPy, SciPy.

Fetch the development version (master branch) from:

StarCluster develop branch and its IPCluster plugin is also required to easily startup a bunch of nodes with IPython.parallel setup.

Patterns currently under investigation

  • Asynchronous & randomized hyper-parameters search (a.k.a. Randomized Grid Search) for machine learning models

  • Share numerical arrays efficiently over the nodes and make them available to concurrently running Python processes without making copies in memory using memory-mapped files.

  • Distributed Random Forests fitting.

  • Ensembling heterogeneous library models.

  • Parallel implementation of online averaged models using a MPI AllReduce, for instance using MiniBatchKMeans on partitioned data.

See the content of the examples/ folder for more details.




This project started at the PyCon 2012 PyData sprint as a set of proof of concept IPython.parallel scripts.

Pyrallel - Parallel Data Analytics in Python

概述:调查分布式计算的实验项目 机器学习模式和其他半互动数据分析 任务。


  • 专注于适合小内存的中小型数据集 (10+个节点)到中等集群(100个节点)。

  • 专注于中小型数据(尽可能使用数据本地化)。

  • 在尝试时关注CPU绑定任务(例如训练随机森林) 将磁盘/网络访问限制在最小限度。

  • 不要专注于HA / Fault Tolerance(尚)。

  • 不要试图发明新的高级编程抽象 (还):使用低级编程模型(IPython.parallel)进行精细化 控制集群元素和消息传递并帮助识别 分布式机器的实际基础约束是什么? 学习设置。

免责声明:此图书馆的公开API可能不会 这个项目的目标是要尽快实现。


通常的嫌疑人:Python 2.7,NumPy,SciPy。


StarCluster 开发分支及其 IPCluster 插件也是必需的 使用IPython.parallel安装轻松启动一堆节点。




此项目始于 PyCon 2012 PyData 冲刺 作为一套概念证明 IPython.parallel 脚本