First oddity, is there a reason that Godot wouldn't use at least 100% on one CPU core when I'm thrashing CPU based functions to the point that script lag is dropping my FPS to sub 45? In fact, it has a tendency to use ~13% total, but never fully utilises a single core.

While:

Second possibly obvious misunderstanding of 'Thread's in Godot, do these actually function in parallel at all when called in Process() or Physics_Process()? I have been doing some testing breaking up a function that iterates an array to perform calculations that normally takes 10ms. I split the array into 4 batches across 4 threads and for the purpose of testing, made 4 versions of the function (so they would show separately in the profiler), as far as I can tell it has the exact same performance as if I were to run them sequentially. Even in the profiler, each function shows up individually as taking almost exactly 2.5ms (not convinced the profiler accounts for thread times in this scenario though).

I also tried calling thread_n.wait_to_finish() across multiple frames

I tried toggling from single-safe to multi-threaded in project settings and this too seems to have no impact. I am just a little confused as to how I would go about optimising this task. I know in principle the above should work as I do similar in Python on the reg.

I dont' know much about the project is is for. But I know godot has optiosn in the project settings for multi-threading. Multi-threading for 2d phsyics, and multi-threading for rendering. It uses bullet physics engine, or godot physics engine. Looks like bullet or godot is single threaded. One problem I have discovered with godot when running a thread. Godot process is limited to 60 FPS. A thread is unlimited. Because Godot limits both the physics and rendering to 60 fps, this might be the reason your not seeing full utilization.

Yeh the single/multi-threaded toggle does nothing as far as I can tell for my 3D project. I've also seen a few git issues raised about Godot's Bullet implementation not being multi-threaded (yet) so not suprising but I didn't consider this could be a 2d only thing.

However my scenario does not relate to physics as such, but rather creating my own threads for my own function that I want to call every frame. I did try setting the physics frame rate to 30 and 120 just to see if it had any impact and again there was minimal perceivable difference apart from slightly choppier movement/animations at 30. Also if the 60fps cap was in some way causing utilisation to not cap, I would only expect that to be true if I was exceeding 60fps, which I'm not.

Essentially what I am seeing is 30-45fps with 30% GPU usage as the CPU function calls I have written are taking up the bulk of the processing time (as shown in the profiler). However despite the CPU being the apparant bottleneck, at no point is any core on the CPU at or even close to 100%. AND when I manually generate 4 new threads to parrallelise iteration and processing of a large list, it SEEMS to still be completing the tasks sequentially. I suspect some aspect of the physics_process or process functions prevent threading from working as I expect them to work, but it seems undocumented. The threading examples in the docs do not cover usage in either _processes so perhaps this is by design?

What I would expect to see from threading for the 4 functions I created , is they would take 2.5ms each, but near simultaneously. This should mean a close to -7.5ms (with the speed now being basically the speed of the slowest thread+overhead) gain per frame which given the ~16ms frame budget is significant. Idk maybe the lack of GIL is throwing me off kilter and I've just missed something.

I am almost tempted to try calling a Python function separately and performing the threading in that, then returning to concretely prove whether threading in process is in fact working, but I've not yet tried to remotely call any python code. Well, good to learn I guess.

Assuming your using gdscript. It might be possible godot would behave differently under c#. I don't know, I have not tried the c#

It might, but I'd rather try to understand how it should behave with GDScript :(

I mean this is just confusing:

I cranked to 1k instanced scenes each with 4 threads running in _process() every frame, but as far as windows is concerned it's still 31 total and won't cap a single core despite the framerate now being in the teens purely from CPU. The function I am calling does seem to be working too.

Hmm I've been chopping it up and trying a bunch of things so the code is super ugly now (basically I thought I might be hitting some kind of file lock, so literally split everything up to try and ensure nothing is reading or writing to the same variable).

It looks something like this atm, which does run, it just doesn't seem to see any reduction in process time. The other thing to note is this is (currently) running on an instanced node, of which there are hundreds. The cumulative process time for seperation() is ~10ms, and when split into 4 it drops to sub ~3ms each, but still accumulates to the same total per frame:

var finalsep_1 = Vector3(0.0,0.0,0.0)
var finalsep_2 = Vector3(0.0,0.0,0.0)
var finalsep_3 = Vector3(0.0,0.0,0.0)
var finalsep_4 = Vector3(0.0,0.0,0.0)

var boiddist = 2
var boiddist2 = 2
var boiddist3 = 2
var boiddist4 = 2

onready var thread1 = Thread.new()
onready var thread2 = Thread.new()
onready var thread3 = Thread.new()
onready var thread4 = Thread.new()

func _process(delta):
# ... 
# other parts of code
# ...

	bl_1 = []
	bl_2 = []	
	bl_3 = []	
	bl_4 = []	

	if boid_list.size()>0:
		#imperfect method for splitting an array into 4 equal sized chunks, but works for the most part
		bl_1 = boid_list.slice(0,int(boid_list.size()/4))
		bl_2 = boid_list.slice(int(boid_list.size()/4),int((boid_list.size()/4)*2))
		bl_3 = boid_list.slice(int((boid_list.size()/4)*2),int((boid_list.size()/4)*3))
		bl_4 = boid_list.slice(int((boid_list.size()/4)*3),int(boid_list.size()))

		# I 'assumed' this bit would essentially run all 4 threads in parrallel...
		thread1.start(self, "seperation_1", null)
		thread2.start(self, "seperation_2", null)
		thread3.start(self, "seperation_3", null)
		thread4.start(self, "seperation_4", null)		

		# Until this rejoins and retrieves them, effectively meaning the processing time is as slow as the slowest thread + some amount of overhead
		thread1.wait_to_finish()
		thread2.wait_to_finish()
		thread3.wait_to_finish()
		thread4.wait_to_finish()
		
		# Then I can just restack the output
		flockSeperation = ((finalsep_1+finalsep_2+finalsep_3+finalsep_4) / boid_list.size())

# There are now four of these, each numbered. 
func seperation_1(x):
	finalsep_1 = Vector3(0.0,0.0,0.0)

	for x in bl_1:
		if self.global_transform.origin != boidpos:
			expensive code runs here then:
				finalsep_1 += output

The more I think about it the less I believe that this is a viable way to gain performance, I mean sub frame wins would be hard to get.

I would think that threading comes especially in handy where you can divide some task up and then queue it to worker threads that do the work in background over many frames and then sync the results back in after an interval. Not sub-frame but over the course of multiple frames.

But my experience with threading is lacking, probably even more so than your own so take it with a pinch o' salt.

Thanks, I think I tried threading across frames but not more than one apart(which I already do to a degree but without threads). Will maybe give it a crack, but I'm a bit over it for now.

Also what you said about "worker threads that do the work in background" may go a long way to explaining it if this is a design decision for how threads in Godot are meant to function. As in (this is purely guessing and may make no sense) if threads only run in effectively 'spare' time, when there is none they effectively don't run until wait_to_finish() is called, which becomes a blocking function for each one resulting in them effectively running sequentially. Though if that's true it doesn't seem accurate to even call it multi-threading, but idk. Might be provable if I could get GPU usage to exceed CPU as this would create that 'spare time' window that threads may be designed to run in?

I guess my main thing was just trying to reach the seemingly elusive 100% usage on either CPU or GPU (or at least close), as in my mind anything less means there's performance breathing room. I assumed ramping up threads would do this but for whatever reason, it doesn't :(

I guess we'll see if godot 4.0+ Vulkan support with lower level control and asynchronous compute might bring some interesting opportunities.

2 years later