Spark源码分析
2024-12-04 10:48:47 0 举报
Spark框架原理
作者其他创作
大纲/内容
shuffle-reader
clientjvm
CoarseGrainedSchedulerBackend
numUsable: 3
窄依赖
getMissingParentStages
ask:RequestSubmitDriver(driverDescription)
threadPool
DAGScheduler
onStart()
flatMap
task线程
StandaloneAppClient
launchTask()task.run
filter
sparksubmitdriver
submitJob
spark-submit
receivereceiveAnadReply
如果你是一个coder有一批数据需要处理遍历递归javacoder递归的成本和遍历的成本【递归容易栈溢出】一个程序猿的级别:初级程序猿中级高级
task.run
RDD
CoarseGrainedExecutorBackend
onStart
没给出
masterjvm
实例/实体
worker
1
太吊了:spark的编程代码是类似于本机的线性代码其实在代码逻辑里,只是发生了newRDD的过程并没有发生计算且RDD之间有依赖关系,组建了一个树,单线链表最终,最后一个RDD的action算子,触发DAG去按着stage递归,通过栈结构遍历RDD寻找shuffle关系切割stage最终划分了正确的stage出来,且递归后的回归过程,就是由前至后的stage提交的过程
inbox
eventThreadrun
sparkEnv
groupByKey
foreach
资源划分优先水平分配Executor
textFile()
postMessage
spark 计算层角色:clinet-->driver-->executor
spreadOutApps : true
memoryPerExecutor 4G
队列
partitionsToCompute
rpcEnv: 基础设施
连接池,线程池,连接数,线程数
2
spark rdd 不用了ibm datastage db2etl
io socket10000000个
sbin/start-all.sh
sparksubmit
3
女
registerRpcEndpoint
WorkeronStart()
exec \"${SPARK_HOME}\"/bin/spark-class org.apache.spark.deploy.SparkSubmit \"$@\"
rootPool
master
executors
9
shuffleMapStage
5C7G
startServer
oneExecutorPerWorker
resourceOffers
hadoopRDDrecordreader
spark executor 资源申请,默认水平申请
reduceByKey
reviveOffers
message类型不一样
missingHashSet
usableWorkers
4
Executor
DAGSchedulerEventProcessLoop
threadpool
2,luanchDriver
def visit
app.desc.coresPerExecutor
sendask
org.apache.spark.executor.CoarseGrainedExecutorBackend
coresPerExecutor : 3
submitMissingTasks
getupdatadeletepostput
d:RDD1:p
runTask
workerjvm
union
g
男
MessageLoop
7,RegisteredExecutor
submitTasks
LaunchExecutor
driverEndpoint
doOnReceive
费曼学习法
e:RDD1:p
RegisterApplication
org.apache.spark.deploy.worker.DriverWrapper
recivereciveAndReply
jion
Master
DriverEndpoint
8,new
map
LaunchDriver
start
5,LuanchExecutor
RegisteredExecutor
cluster
post
HadoopRDD
jvm
minCoresPerExecutor 3C
6
mapreduce里是通过 反射!spark:反序列化!
thread容器是线程,没有业务逻辑最终是把某一个对象的run方法压栈,压的方法帧针对这个run方法是不是死循环!!!
$@:spark-sumit --master spark://sdfsdf:7077--deploy-mode client/cluster--class ooxx.classxxoo.jar--executr-cores/memory
方法帧
StandaloneSchedulerBackend
Rest风格
ClientApp
ExecutorLaunch(appMaster)
TaskSchedulerImpl
ShuffleMapTask
术语:核心:数据期望分布式计算!相干计算:需要shuffle不相干计算:组成了stage在一台可以完成pipeline:迭代器嵌套模式窄依赖stage与stage之间是需要shuffle的task任务一个stage的任务task的数量是多少是由stage里最后一个rdd的分区的数量决定的ResultStage : 一个job可以有N个stage一个job里最优的哪一个stagestage很简单,值包含一个RDDShuffleMapStage : 一个job可以有多个stage出了最后一个stage,其他的都是shufflemapstagetask:ShuffleMapTaskResultTask
b
RPC
a:RDD3:p
3,RegisterApplication
Dispatcher
计算层:cluster模式
sparkContext
client jvm
shufflemaptaskresulttask
sparkShell
1,ask:RequestSubmitDriver(driverDescription)
runJob
SparkEnv
b: msb 分区器 % 3 =2x的2号分区有msb,其他分区有吗?没有从x到g 还需不需要shuffle?如果需要:x: msb 分区器 % 3 =2不需要shuffle
g:RDD3:p
taskBinary
ref
我们代码的主方法driver在 client端
endpoint
shuffle是一个很复杂的系统,其目的是让计算调优
executor
client
hadoopRDD
org.apache.spark.deploy.master.Master
send
ClientEndpoint
start-master.sh
shufflemanager
MappartitionsRDD
name
rpcEnv
6C8G
x
RpcEndpointreceivereceiveAndReply
initialize
6,RegisterExecutor
submitStage
shuflledRDDreader
RegisterExecutor
stage
Master JVM
coresToAssign : 9
分发
给出
eventProcessLoop
yarn
x:RDD3:p
eventQueue
coresPerExecutor : ()
waitingForVisitArrayStack
receiversLinkedBlockingQueue
receivers
process
b:RDD3:p
shuffle
EndpointData
taskBinaryBytes
ClientEndpointonStart
RpcEndpointRefsendask
f:RDD2:p
0
DriversparkContextjvm
NettyRpcEnv
sc
9C12G
传输层
deployMode
start-slaves.sh
f
handleJobSubmitted
rdd迭代器嵌套计算
下载spark2.3.4 源码导入到idea中import mavenhttps://github.com/apache/spark/tree/v2.3.41.资源层2.计算层分布式的!
visited
netty
栈结构
Worker JVM
shuffle-writer
ApplicationMasterexecutorLaunchDriver
schedulableBuilder
序列化会把属性指向的对象一并序列化
EventLoop
org.apache.spark.deploy.worker.Worker
assignedCores
spark的task是怎么来的,怎么分给executor的
传输服务TransportServer
launchTask
app:--executor-cores 3--totle-executor-cores 9--executor-memory 4g
c:RDD1:p
SparkSubmit
assignedExecutors
join
app.desc.memoryPerExecutorMB
taskIdToLocations
DriverWrapper JVM
endpoints
isStandaloneCluster
finalRDD
endpointRefs
run->死循环
0 条评论
下一页