Welcome to 16892 Developer Community-Open, Learning,Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I deployed a simple spark application on kubernetes with following configurations:

spark.executor.instances=2;spark.executor.memory=8g; spark.dynamicAllocation.enabled=true;spark.dynamicAllocation.shuffleTracking.enabled=true; spark.executor.cores=2;spark.dynamicAllocation.minExecutors=2;spark.dynamicAllocation.maxExecutors=2;

Memory requirements of Executor PODS are more than what is available on Kubernetes cluster and due to this Spark Executor PODS always stay in PENDING state as below.

$ kubectl get all
NAME                                        READY   STATUS    RESTARTS   AGE
pod/spark-k8sdemo-6e66d576f655b1f5-exec-1   0/1     Pending   0          10m
pod/spark-k8sdemo-6e66d576f655b1f5-exec-2   0/1     Pending   0          10m
pod/spark-master-6d9bc767c6-qsk8c           1/1     Running   0          10m

I know the reason is non-availability of resources as show in Kubectl describe command:

$ kubectl describe pod/spark-k8sdemo-6e66d576f655b1f5-exec-1
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  28s (x12 over 12m)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.

On the other hand, driver pods keeps waiting forever for Executor PODS to get ample resources as below.

$ kubectl logs pod/spark-master-6d9bc767c6-qsk8c
21/01/12 11:36:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/01/12 11:37:01 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/01/12 11:37:16 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Now my question here is if there is some way to make driver wait only for sometime/retries and if Executors still don't get resource, driver POD should auto-die with printing proper message/logs e.g. "application aborted as there were no resources in cluster".

I went through all Spark configurations for above requirement but couldn't fine any. Though in YARN we have spark.yarn.maxAppAttempts but nothing similar was found for Kubernetes.

If no such configuration is available in Spark Is there a way in kubernetes POD definition to achieve the same.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
282 views
Welcome To Ask or Share your Answers For Others

1 Answer

This is Apache Spark 3.0.1 here. No idea if things get any different in the upcoming 3.1.1.


tl;dr I don't think there's a built-in support for "driver wait only for sometime/retries and if Executors still don't get resource, driver POD should auto-die with printing proper message/logs".


My very basic understanding of Spark on Kubernetes lets me claim that there is no such feature to "auto-die" the driver pod when there are no resources for executor pods.

There is podCreationTimeout (based on spark.kubernetes.allocation.batch.delay configuration property) and spark.kubernetes.executor.deleteOnTermination configuration property that make Spark on Kubernetes delete executor pods requested but not created, but that's not really what you want.

Dynamic Allocation of Executors could make things a bit more complex, but it does not really matter in this case.


A workaround could be to use spark-submit --status to request status of a Spark application and check whether it's up and running or not and --kill it after a certain time threshold (you could achieve a similar thing using kubectl directly too).


Just a FYI and to make things a bit more interesting. You should be reviewing the two other Spark on Kubernetes-specific configuration properties:

  • spark.kubernetes.driver.request.cores
  • spark.kubernetes.executor.request.cores

There could be others.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to 16892 Developer Community-Open, Learning and Share
...