Space Clouds

This past week….

As I’m sure everyone noticed, we have had a lot of problems this reset. Some of it because we weren’t ready on time, other due to the unavoidable need to track and fix year+ old corruptions that happened to cause problems in the uni reset patch. We worked intensively during the week to bring everything back to a normal state as quickly as possible and I can now say somewhat safely that the server is at a pretty good performance level (it is actually running faster than it was last uni now and should be more reliable thanks to the new memory validation).

A beta client is now available that contain fixes for most of the dumps that were sent (2 of which appeared broken / corrupted and as such i couldn’t see the problem, but it might be the same as other dumps). You can download the beta client from http://www.starsonata.com/dl/StarSonata_2_beta.exe. If you crash on the beta client, please be sure to send in your starsonata.dmp alongside a short description of what you were doing to dmpfile (at) starsonata.com

The visibility issue will probably only be fixed on Sunday, but at least from the quick checking I done on it, it appear to only be client side. AI etc should see you at your normal visibility, but your client sees everything as ultra high visibility.

The catching up event will begin now, from Saturday 1h30am to Wednesday 1h30am, experience gained, team experience gained, workhours and colony years will all be 3 times faster than normal

For those interested in the server fixing process, keep reading…

The major issues encountered were from code written years ago, a specific one could be traced to 8 years ago. Those were the kind of problem that caused crash that were extremely hard to track and properly fix. I spent months working and fixing those slowly but the harder ones kept escaping my grasp. I knew they were there since their signature was still in the crash that happened every few days / weeks depending on luck, but i could never pinpoint exactly where it came from. There were in total 5 such corruption source, 4 of which were only possible in a multiprocessor (as in multiple physical processor, not multiple core on a single processor) configuration.

After upgrading our memory fence from an old x86 asm call to an sse2 approach and modifying our map and set containers to use our own mutexed implementation instead of using directly the stl. Many of the real problems started surfacing (mostly hitting some memory redzone i added to our containers since i knew the corruption was container related). Fixing those took way more time that i could ever have expected, but this is now mostly behind us.

The first version I wrote was built in a way that it required no code change for all the thousands of variables using those container to actually use our new threadsafe version, but this implied keeping the original interface which is very poor when it come to acquiring / releasing locks (it had to be done in every calls, and iterator acted as a stack lock object so that once you requested an iterator on such container, only the thread owning the iterator could work with it until you were done with the iterator etc).

With that, most of those crashes went away and were replaced by a somewhat large lag (this happened about 30 hours ago, at which point server became a lot more stable, but also about 15 times slower). This called for a faster alternative for some of our most accessed containers (like for instance, the list of all souls in the universe or the list of items in a ship), for that i used a singlelockcontainer object approach. The difference with this compared to the previous one, is that it is not a “plug and play” replacement for the map or set (or any object in its case), it require actual code change to every location accessing the object in question, but at least enforce it. Without cheating the system knowingly (calling a reference and copying the pointer or similar “hacks”) it is impossible for a piece of code to access its content without first requesting a lock. Once the lock is granted the object can be read and written to by the thread which locked it. This is an order of magnitude faster than the “plug and play” approach, but is generally not worth using for small containers or containers only used for direct access etc.

I slowly replaced the slow parts with the new faster singlelockcontainer until the performance were at a point where the overhead from validating and enforcing memory access to all those containers did not actually affect performance, which is about where we are now.

Next step is to fix all the client issues and a few remaining server bugs, but very soon we should be back to weeks+ uptime on the server without crashes.

Discuss in the Forum