Discussion:
New libc malloc patch
(too old to reply)
Jason Evans
2005-11-29 07:05:57 UTC
Permalink
There is a patch that contains a new libc malloc implementation at:

http://www.canonware.com/~jasone/jemalloc/jemalloc_20051127a.diff

This implementation is very different from the current libc malloc.
Probably the most important difference is that this one is designed
with threads and SMP in mind.

The patch has been tested for stability quite a bit already, thanks
mainly to Kris Kennaway. However, any help with performance testing
would be greatly appreciated. Specifically, I'd like to know how
well this malloc holds up to threaded workloads on SMP systems. If
you have an application that relies on threads, please let me know
how performance is affected.

Naturally, if you notice horrible performance or ridiculous resident
memory usage, that's a bad thing and I'd like to hear about it.

Thanks,
Jason

=== Important notes:

* You need to do a full buildworld/installworld in order for the
patch to work correctly, due to various integration issues with the
threads libraries and rtld.

* The virtual memory size of processes, as reported in the SIZE field
by top, will appear astronomical for almost all processes (32+ MB).
This is expected; it is merely an artifact of using large mmap()ed
regions rather than sbrk().

* In keeping with the default option settings for CURRENT, the A and
J flags are enabled by default. When conducting performance tests,
specify MALLOC_OPTIONS="aj" .
Hiten Pandya
2005-11-29 10:52:07 UTC
Permalink
Jason,

I see that you have included an implementation of red-black tree CPP
macros, but wouldn't it be better if you were to use the ones in
<sys/tree.h> ? I have only had a precursory look, but I would have
thought that would be the way to go.

Just a suggestion.

Best regards,

--
Hiten Pandya
hmp at freebsd.org
Post by Jason Evans
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051127a.diff
This implementation is very different from the current libc malloc.
Probably the most important difference is that this one is designed
with threads and SMP in mind.
The patch has been tested for stability quite a bit already, thanks
mainly to Kris Kennaway. However, any help with performance testing
would be greatly appreciated. Specifically, I'd like to know how
well this malloc holds up to threaded workloads on SMP systems. If
you have an application that relies on threads, please let me know
how performance is affected.
Naturally, if you notice horrible performance or ridiculous resident
memory usage, that's a bad thing and I'd like to hear about it.
Thanks,
Jason
* You need to do a full buildworld/installworld in order for the
patch to work correctly, due to various integration issues with the
threads libraries and rtld.
* The virtual memory size of processes, as reported in the SIZE field
by top, will appear astronomical for almost all processes (32+ MB).
This is expected; it is merely an artifact of using large mmap()ed
regions rather than sbrk().
* In keeping with the default option settings for CURRENT, the A and
J flags are enabled by default. When conducting performance tests,
specify MALLOC_OPTIONS="aj" .
Jason Evans
2005-11-29 16:41:11 UTC
Permalink
Post by Hiten Pandya
I see that you have included an implementation of red-black tree CPP
macros, but wouldn't it be better if you were to use the ones in
<sys/tree.h> ? I have only had a precursory look, but I would have
thought that would be the way to go.
There is a feature missing from sys/tree.h that I need (rb_nsearch()
in the patch), but you are right that it would probably be best to
use sys/tree.h. I am going to work on adding RB_NFIND(), and will
then try switching to sys/tree.h.

Thanks,
Jason
Jason Evans
2005-12-01 07:20:43 UTC
Permalink
Post by Hiten Pandya
I see that you have included an implementation of red-black tree CPP
macros, but wouldn't it be better if you were to use the ones in
<sys/tree.h> ? I have only had a precursory look, but I would have
thought that would be the way to go.
There's an updated patch available:

http://www.canonware.com/~jasone/jemalloc/jemalloc_20051201a.diff

This patch includes the following changes:

*) Use sys/tree.h rather than a separate red-black tree implementation.

*) Use the __isthreaded symbol to avoid locking for single-threaded
programs, and to simplify malloc initialization. The extra branches
that are required to check __isthreaded should be more than offset by
the removal of an atomic compare/swap operation.

*) Fix an obscure bug (very difficult to trigger without changing
some compile-time constants).

Jason
Hiten Pandya
2005-12-01 11:38:59 UTC
Permalink
Thanks Jason!

Kind Regards,

--
Hiten Pandya
hiten.pandya at gmail.com
Post by Jason Evans
Post by Hiten Pandya
I see that you have included an implementation of red-black tree CPP
macros, but wouldn't it be better if you were to use the ones in
<sys/tree.h> ? I have only had a precursory look, but I would have
thought that would be the way to go.
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051201a.diff
*) Use sys/tree.h rather than a separate red-black tree implementation.
*) Use the __isthreaded symbol to avoid locking for single-threaded
programs, and to simplify malloc initialization. The extra branches
that are required to check __isthreaded should be more than offset by
the removal of an atomic compare/swap operation.
*) Fix an obscure bug (very difficult to trigger without changing
some compile-time constants).
Jason
Maxim Sobolev
2005-11-29 11:37:32 UTC
Permalink
Just curious what is the grand plan for this work? I wonder if it will
make sense to have two malloc's in the system, so that user can select
one which better suits his needs.

-Maxim
Post by Jason Evans
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051127a.diff
This implementation is very different from the current libc malloc.
Probably the most important difference is that this one is designed with
threads and SMP in mind.
The patch has been tested for stability quite a bit already, thanks
mainly to Kris Kennaway. However, any help with performance testing
would be greatly appreciated. Specifically, I'd like to know how well
this malloc holds up to threaded workloads on SMP systems. If you have
an application that relies on threads, please let me know how
performance is affected.
Naturally, if you notice horrible performance or ridiculous resident
memory usage, that's a bad thing and I'd like to hear about it.
Thanks,
Jason
* You need to do a full buildworld/installworld in order for the patch
to work correctly, due to various integration issues with the threads
libraries and rtld.
* The virtual memory size of processes, as reported in the SIZE field by
top, will appear astronomical for almost all processes (32+ MB). This
is expected; it is merely an artifact of using large mmap()ed regions
rather than sbrk().
* In keeping with the default option settings for CURRENT, the A and J
flags are enabled by default. When conducting performance tests,
specify MALLOC_OPTIONS="aj" .
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current
Jason Evans
2005-11-29 16:55:13 UTC
Permalink
Post by Maxim Sobolev
Just curious what is the grand plan for this work? I wonder if it
will make sense to have two malloc's in the system, so that user
can select one which better suits his needs.
The plan for this work is to replace the current malloc, rather than
augmenting it. There is a long history in Unix of using shared
library tricks to override the system malloc, and the patch does not
change the ability to do so. However, in my opinion, explicitly
providing multiple implementations of malloc in the base OS misses
the point of providing a general purpose memory allocator. The goal
is to have a single implementation that works well for the vast
majority of extant programs, and to allow applications to provide
their own implementations when the general purpose allocator fails to
perform adequately. phkmalloc did an excellent job in this capacity
for quite some time, but now that we need to commonly support
threaded programs on SMP systems, phkmalloc is being strained rather
badly. This isn't an indication that we need multiple malloc
implementations in the base OS; rather it indicates that the system
malloc implementation needs to take into account constraints that did
not exist when phkmalloc was designed.

Jason
Poul-Henning Kamp
2005-11-29 19:14:00 UTC
Permalink
[...] phkmalloc did an excellent job in this capacity
for quite some time, but now that we need to commonly support
threaded programs on SMP systems, phkmalloc is being strained rather
badly. This isn't an indication that we need multiple malloc
implementations in the base OS; rather it indicates that the system
malloc implementation needs to take into account constraints that did
not exist when phkmalloc was designed.
The malloc phkmalloc replaced was written at some point in the
1980ies on a VAX, and more or less assumed the Vax was effectively
a single user machine and without effective paging algorithms.

Phkmalloc was written in 1994/5 where I had 4MB of RAM in my
"Gateway Handbook 486" and very strongly assumed that with the
RAM prices of the day, I could not afford an upgrade.

I gave a talk about phkmalloc at USENIX ATC 1998 in New Orleans.
One of the central points in the talk was that infrastructure code
should have regular service overhauls, to check that the assumptions
in the design is still valid.

In addition to assumptions phkmalloc makes which are no longer
relevant, there are many assumptions which should be made today
which phkmalloc is not aware of, multi-threading being but one of
them. Cache line effects, pipeline prefetching, multi-cpu systems,
different VM system algorithms, larger address spaces etc etc etc.

Once Jason is done, I have no doubts that "jemalloc" will beat
phkalloc in all relevant benchmarking thereby neatly rendering any
discussion about having multiple mallocs in the tree pointless.

A big thank you from the author of phkmalloc to Jason for following
the service manual to the letter :-)

Poul-Henning
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Jon Dama
2005-11-29 20:06:39 UTC
Permalink
I have a rather strong objection to make to this proposal (read: if this
change goes in I'm going to have to go through the effort of ripping it
out locally...):

There exists a problem right now--localized to i386 and any other arch
based on 32-bit pointers: address space is simply too scarce.

Your decision to switch to using mmap as the exclusive source of malloc
buckets is admirable for its modernity but it simply cannot stand unless
someone steps up to change the way mmap and brk interact within the
kernel.

The trouble arises from the need to set MAXDSIZ and the resulting effect
it has in determining the start of the mmap region--which I might add is
the location that the shared library loader is placed. This effectively
(and explicitly) sets the limit for how large of a contiguous region can
be allocated with brk.

What you've done by switching the system malloc to exclusively using
mmap is induced a lot of motivation on the part of the sysadmin to push
that brk/mmap boundary down.

This wouldn't be a problem except that you've effectively shot in the foot
dozens of alternative c malloc implementations, not to mention the memory
allocator routines used in obscure languages such as Modula-3 and Haskell
that rely on brk derived buckets.

This isn't playing very nicely!

I looked into the issues and limitations with phkmalloc several months ago
and concluded that simply adopting ptmalloc2 (the linux malloc) was the
better approach--notably it is willing to draw from both brk and mmap, and
it also implements per-thread arenas.

There is also cause for concern about your "cache-line" business. Simply
on the face of it there is the problem that the scheduler does not do a
good job of pinning threads to individual CPUs. The threads are already
bounding from cpu to cpu and thrashing (really thrashing) each CPU cache
along the way.

Second, you've forgotten that there is a layer of indirection between your
address space and the cache: the mapping of logical pages (what you can
see in userspace) to physical pages (the addresses of which actually
matter for the purposes of the cache). I don't recall off-hand whether or
not the L1 cache on i386 is based on tags of the virtual addresses, but I
am certain that the L2 and L3 caches tag the physical addresses not the
virtual addresses.

This means that your careful address selection based on cache-lines will
only work out if it is done in the vm codepath: remember the mapping of
physical addresses to the virtual addresses that come back from mmap can
be delayed arbitrarily long into the future depending on when the program
actually goes to touch that memory.

Furthermore, the answer may vary depending on the architecture or even the
processor version.

-Jon
Post by Jason Evans
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051127a.diff
This implementation is very different from the current libc malloc.
Probably the most important difference is that this one is designed
with threads and SMP in mind.
The patch has been tested for stability quite a bit already, thanks
mainly to Kris Kennaway. However, any help with performance testing
would be greatly appreciated. Specifically, I'd like to know how
well this malloc holds up to threaded workloads on SMP systems. If
you have an application that relies on threads, please let me know
how performance is affected.
Naturally, if you notice horrible performance or ridiculous resident
memory usage, that's a bad thing and I'd like to hear about it.
Thanks,
Jason
* You need to do a full buildworld/installworld in order for the
patch to work correctly, due to various integration issues with the
threads libraries and rtld.
* The virtual memory size of processes, as reported in the SIZE field
by top, will appear astronomical for almost all processes (32+ MB).
This is expected; it is merely an artifact of using large mmap()ed
regions rather than sbrk().
* In keeping with the default option settings for CURRENT, the A and
J flags are enabled by default. When conducting performance tests,
specify MALLOC_OPTIONS="aj" .
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current
Jason Evans
2005-11-29 21:07:06 UTC
Permalink
Jon,

Thanks for your comments. Fortunately, I don't think things are
quite as hopeless as you think, though you may be right that some
adjustments are necessary. Specific replies follow.
Post by Jon Dama
There exists a problem right now--localized to i386 and any other arch
based on 32-bit pointers: address space is simply too scarce.
Your decision to switch to using mmap as the exclusive source of malloc
buckets is admirable for its modernity but it simply cannot stand unless
someone steps up to change the way mmap and brk interact within the
kernel.
The trouble arises from the need to set MAXDSIZ and the resulting effect
it has in determining the start of the mmap region--which I might add is
the location that the shared library loader is placed. This
effectively
(and explicitly) sets the limit for how large of a contiguous
region can
be allocated with brk.
What you've done by switching the system malloc to exclusively using
mmap is induced a lot of motivation on the part of the sysadmin to push
that brk/mmap boundary down.
This wouldn't be a problem except that you've effectively shot in the foot
dozens of alternative c malloc implementations, not to mention the memory
allocator routines used in obscure languages such as Modula-3 and Haskell
that rely on brk derived buckets.
This isn't playing very nicely!
Where should MAXDSIZ be? Given scarce address space, the best we can
hope for is setting it to the "least bad" default, as measured by
what programs we care about do. No matter what we do, some programs
lose.

That said, it turns out that adding the ability to allocate via brk
isn't hard. The code already contains the logic to recycle address
ranges, in order to reduce mmap system call overhead. All of the
locking infrastructure is also already in place. The only necessary
modifications are 1) explicit use of brk until all data segment space
is consumed, 2) special case code for addresses in the brk range, so
that madvise() is used instead of munmap(), and 3) preferential re-
use of the brk address space over mmap'ed memory.

Do you agree that there is no need for using brk on 64-bit systems?
Post by Jon Dama
I looked into the issues and limitations with phkmalloc several months ago
and concluded that simply adopting ptmalloc2 (the linux malloc) was the
better approach--notably it is willing to draw from both brk and mmap, and
it also implements per-thread arenas.
ptmalloc takes a different approach to per-thread arenas that has
been shown in multiple papers to not scale as well as the approach I
took. The difference isn't significant until you get to 8+ CPUs, but
we already have systems running with enough CPUs that this is an issue.
Post by Jon Dama
There is also cause for concern about your "cache-line" business.
Simply
on the face of it there is the problem that the scheduler does not do a
good job of pinning threads to individual CPUs. The threads are already
bounding from cpu to cpu and thrashing (really thrashing) each CPU cache
along the way.
Second, you've forgotten that there is a layer of indirection
between your
address space and the cache: the mapping of logical pages (what you can
see in userspace) to physical pages (the addresses of which actually
matter for the purposes of the cache). I don't recall off-hand whether or
not the L1 cache on i386 is based on tags of the virtual addresses, but I
am certain that the L2 and L3 caches tag the physical addresses not the
virtual addresses.
This means that your careful address selection based on cache-lines will
only work out if it is done in the vm codepath: remember the
mapping of
physical addresses to the virtual addresses that come back from mmap can
be delayed arbitrarily long into the future depending on when the program
actually goes to touch that memory.
Furthermore, the answer may vary depending on the architecture or even the
processor version.
I don't think you understand the intent for the "cache-line
business". There is only one intention: avoid storing data
structures in the same cache line if they are likely to be accessed
simultaneously by multiple threads. For example, if two independent
structures are stored right next to each other, although they do not
require synchronization protection from each other, the hardware will
send a cache line invalidation message to other CPUs every time
anything in the cache line is modified. This means horrible cache
performance if the data are modified often.

As a particular example, there are per-arena data structures that are
modified quite often (arena_t). If two arenas were right next to
each other, then they could share a cache line, and performance would
potentially be severly impacted.

|---arena_t---|---arena_t---|
| | | | | | | |
^^^
BAD!

Thanks,
Jason
Jon Dama
2005-11-29 22:21:07 UTC
Permalink
Jason,

Actually I didn't mean to imply it was hopeless at all. :-) Obviously
there are solutions to the address space issues and agree it is possible
to change your code to improve the situation quite a bit; though I think
the best one might be to permit mmap to actually consume space below
maxdsiz either when hinted to do so or when the space above is consumed,
but before I recommend that too much, I must say that I haven't looked
into what that would entail at all.

Let me take a closer look at what you are doing with regards to
cache-lines. You seem to be implying that you are only taking care in
regards to how you malloc within a given page?

I have a suspicion that it might just be better to dump the problem on to
the application in the sense that no malloc should ever be less
than the size of one cache line. Perhaps this is what you are doing?

-Jon
Post by Jason Evans
Jon,
Thanks for your comments. Fortunately, I don't think things are
quite as hopeless as you think, though you may be right that some
adjustments are necessary. Specific replies follow.
Post by Jon Dama
There exists a problem right now--localized to i386 and any other arch
based on 32-bit pointers: address space is simply too scarce.
Your decision to switch to using mmap as the exclusive source of malloc
buckets is admirable for its modernity but it simply cannot stand unless
someone steps up to change the way mmap and brk interact within the
kernel.
The trouble arises from the need to set MAXDSIZ and the resulting effect
it has in determining the start of the mmap region--which I might add is
the location that the shared library loader is placed. This
effectively
(and explicitly) sets the limit for how large of a contiguous region can
be allocated with brk.
What you've done by switching the system malloc to exclusively using
mmap is induced a lot of motivation on the part of the sysadmin to push
that brk/mmap boundary down.
This wouldn't be a problem except that you've effectively shot in the foot
dozens of alternative c malloc implementations, not to mention the memory
allocator routines used in obscure languages such as Modula-3 and Haskell
that rely on brk derived buckets.
This isn't playing very nicely!
Where should MAXDSIZ be? Given scarce address space, the best we can
hope for is setting it to the "least bad" default, as measured by
what programs we care about do. No matter what we do, some programs
lose.
That said, it turns out that adding the ability to allocate via brk
isn't hard. The code already contains the logic to recycle address
ranges, in order to reduce mmap system call overhead. All of the
locking infrastructure is also already in place. The only necessary
modifications are 1) explicit use of brk until all data segment space
is consumed, 2) special case code for addresses in the brk range, so
that madvise() is used instead of munmap(), and 3) preferential re-
use of the brk address space over mmap'ed memory.
Do you agree that there is no need for using brk on 64-bit systems?
Post by Jon Dama
I looked into the issues and limitations with phkmalloc several months ago
and concluded that simply adopting ptmalloc2 (the linux malloc) was the
better approach--notably it is willing to draw from both brk and mmap, and
it also implements per-thread arenas.
ptmalloc takes a different approach to per-thread arenas that has
been shown in multiple papers to not scale as well as the approach I
took. The difference isn't significant until you get to 8+ CPUs, but
we already have systems running with enough CPUs that this is an issue.
Post by Jon Dama
There is also cause for concern about your "cache-line" business.
Simply
on the face of it there is the problem that the scheduler does not do a
good job of pinning threads to individual CPUs. The threads are already
bounding from cpu to cpu and thrashing (really thrashing) each CPU cache
along the way.
Second, you've forgotten that there is a layer of indirection between your
address space and the cache: the mapping of logical pages (what you can
see in userspace) to physical pages (the addresses of which actually
matter for the purposes of the cache). I don't recall off-hand whether or
not the L1 cache on i386 is based on tags of the virtual addresses, but I
am certain that the L2 and L3 caches tag the physical addresses not the
virtual addresses.
This means that your careful address selection based on cache-lines will
only work out if it is done in the vm codepath: remember the
mapping of
physical addresses to the virtual addresses that come back from mmap can
be delayed arbitrarily long into the future depending on when the program
actually goes to touch that memory.
Furthermore, the answer may vary depending on the architecture or even the
processor version.
I don't think you understand the intent for the "cache-line
business". There is only one intention: avoid storing data
structures in the same cache line if they are likely to be accessed
simultaneously by multiple threads. For example, if two independent
structures are stored right next to each other, although they do not
require synchronization protection from each other, the hardware will
send a cache line invalidation message to other CPUs every time
anything in the cache line is modified. This means horrible cache
performance if the data are modified often.
As a particular example, there are per-arena data structures that are
modified quite often (arena_t). If two arenas were right next to
each other, then they could share a cache line, and performance would
potentially be severly impacted.
|---arena_t---|---arena_t---|
| | | | | | | |
^^^
BAD!
Thanks,
Jason
Jason Evans
2005-11-29 22:27:26 UTC
Permalink
Post by Jon Dama
Let me take a closer look at what you are doing with regards to
cache-lines. You seem to be implying that you are only taking care in
regards to how you malloc within a given page?
You are correct that I am only taking care about allocations within a
given page.
Post by Jon Dama
I have a suspicion that it might just be better to dump the problem on to
the application in the sense that no malloc should ever be less
than the size of one cache line. Perhaps this is what you are doing?
I am only worrying about cache line alignment for malloc's internal
data structures. It's up to the application to do this for its
allocations, if necessary (doing so for all allocations would induce
unacceptable internal fragmentation). This implementation provides
posix_memalign(3), which makes it much less painful for the
application to do so.

Jason
David Xu
2005-11-30 00:03:20 UTC
Permalink
Post by Jon Dama
I looked into the issues and limitations with phkmalloc several months ago
and concluded that simply adopting ptmalloc2 (the linux malloc) was the
better approach--notably it is willing to draw from both brk and mmap, and
it also implements per-thread arenas.
Hi Jon,

Is there any chance to test the jamalloc and ptmalloc2 ? I would
like to see next ten years, we will use a best performance memory
allocator. :-)

David Xu
Jason Evans
2005-12-03 07:49:37 UTC
Permalink
Post by Jon Dama
There exists a problem right now--localized to i386 and any other arch
based on 32-bit pointers: address space is simply too scarce.
Your decision to switch to using mmap as the exclusive source of malloc
buckets is admirable for its modernity but it simply cannot stand unless
someone steps up to change the way mmap and brk interact within the
kernel.
There's a new version of the patch available at:

http://www.canonware.com/~jasone/jemalloc/jemalloc_20051202b.diff

This version of the patch adds the following:

* Prefer to use sbrk() rather than mmap() for the 32-bit platforms.

* Lazily create arenas, so that single-threaded applications don't
dedicate space to arenas they never use.

* Add the '*' and '/' MALLOC_OPTIONS flags, which allow control over
the number of arenas.

As of this patch, all of the issues that were brought to my attention
have been addressed. This is a good time for additional review and
serious benchmarking.

Thanks,
Jason
David Xu
2005-12-03 08:26:02 UTC
Permalink
Post by Jason Evans
Post by Jon Dama
There exists a problem right now--localized to i386 and any other arch
based on 32-bit pointers: address space is simply too scarce.
Your decision to switch to using mmap as the exclusive source of malloc
buckets is admirable for its modernity but it simply cannot stand unless
someone steps up to change the way mmap and brk interact within the
kernel.
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051202b.diff
* Prefer to use sbrk() rather than mmap() for the 32-bit platforms.
* Lazily create arenas, so that single-threaded applications don't
dedicate space to arenas they never use.
* Add the '*' and '/' MALLOC_OPTIONS flags, which allow control over
the number of arenas.
As of this patch, all of the issues that were brought to my attention
have been addressed. This is a good time for additional review and
serious benchmarking.
Thanks,
Jason
I have a question about mutex used in the patch, you are using
a spin loop, isn't it suboptimal ? and a thread library like libpthread
supports static priority scheduling, this mutex does not work, it
will causes a dead lock, if a lower priority thread locked the mutex,
and preempted by a higher priority thread, and the higher priority
thread also calls malloc, it will spin there to wait lower
priority thread to complete, but that will never happen.

David Xu
Jason Evans
2005-12-03 21:08:51 UTC
Permalink
Post by David Xu
I have a question about mutex used in the patch, you are using
a spin loop, isn't it suboptimal ? and a thread library like
libpthread
supports static priority scheduling, this mutex does not work, it
will causes a dead lock, if a lower priority thread locked the mutex,
and preempted by a higher priority thread, and the higher priority
thread also calls malloc, it will spin there to wait lower
priority thread to complete, but that will never happen.
David,

You are correct that this is a problem. Thank you for pointing it
out. There's a new patch that uses the spinlocks that are provided
by the threads libraries. Please let me know if this looks okay.

Also, this patch removes/modifies the code that was causing build
failures on amd64, so it's worth giving another try. Hopefully, it
will compile now...

http://www.canonware.com/~jasone/jemalloc/jemalloc_20051203a.diff

Thanks,
Jason
David Xu
2005-12-04 01:40:39 UTC
Permalink
Post by Jason Evans
Post by David Xu
I have a question about mutex used in the patch, you are using
a spin loop, isn't it suboptimal ? and a thread library like libpthread
supports static priority scheduling, this mutex does not work, it
will causes a dead lock, if a lower priority thread locked the mutex,
and preempted by a higher priority thread, and the higher priority
thread also calls malloc, it will spin there to wait lower
priority thread to complete, but that will never happen.
David,
You are correct that this is a problem. Thank you for pointing it
out. There's a new patch that uses the spinlocks that are provided
by the threads libraries. Please let me know if this looks okay.
Also, this patch removes/modifies the code that was causing build
failures on amd64, so it's worth giving another try. Hopefully, it
will compile now...
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051203a.diff
Thanks,
Jason
The libc spinlocks are deprecated, in fact, thread libraries try to keep
track
off all spinlocks in libc and reset them in child process, they will
complain
if there are too many spinlocks, this is not very correct, but would resolve
dead lock in real world applications (weird applications).
Because I see you have put _malloc_prefork() and _malloc_postfork()
hooks in thread libraries, I guess you want to manage all malloc locks, so
you might don't need to use the spinlocks, you can implement these
locks by using umtx provided by kernel, you can use UMTX_OP_WAIT
and UMTX_OP_WAKE to implement these locks, the UMTX_OP_LOCK
and UMTX_OP_UNLOCK can also be used to implement locks, but I reserve
these two functions since I have plan to implement reliable POSIX process
shared mutex. you can find those code in libthr to study how to use umtx.
Last, I don't know if umtx will work with libc_r, but libc_r has already
been
disconneted from world for some days, it will rot away.

Regards,
David Xu
David Xu
2005-12-04 02:53:01 UTC
Permalink
Here is sample code to implement a mutex by using umtx syscalls:

#include <errno.h>
#include <stddef.h>
#include <sys/ucontext.h>
#include <sys/umtx.h>
#include <sys/types.h>
#include <machine/atomic.h>
#include <pthread.h>

#define LCK_UNLOCKED 0
#define LCK_LOCKED 1
#define LCK_CONTENDED 2

void
lock_mtx(struct umtx *mtx)
{
volatile uintptr_t *m = (volatile uintptr_t *)mtx;

for (;;) {
/* try to lock it. */
if (atomic_cmpset_acq_ptr(m, LCK_UNLOCKED, LCK_LOCKED))
return;
if (atomic_load_acq_ptr(m) == LCK_LOCKED) {
/*
* if it was locked by single thread, try to
* set it to contented state.
*/
if (!atomic_cmpset_acq_ptr(m, LCK_LOCKED, LCK_CONTENDED))
continue;
}
/* if in contented state, wait it to be unlocked. */
if (atomic_load_acq_ptr(m) == LCK_CONTENDED)
_umtx_op((struct umtx *)m, UMTX_OP_WAIT, LCK_CONTENDED, 0,
NULL);
}
}

void
unlock_mtx(struct umtx *mtx)
{
volatile uintptr_t *m = (volatile uintptr_t *)mtx;

for (;;) {
if (atomic_load_acq_ptr(m) == LCK_UNLOCKED)
err(1, "unlock a unlocked mutex\n");
if (atomic_load_acq_ptr(m) == LCK_LOCKED) {
if (atomic_cmpset_acq_ptr(m, LCK_LOCKED, LCK_UNLOCKED))
return;
}
if (atomic_load_acq_ptr(m) == LCK_CONTENDED) {
atomic_store_rel_ptr(m, LCK_UNLOCKED);
_umtx_op((struct umtx *)m, UMTX_OP_WAKE, 1, NULL, NULL);
break;
}
}
}

struct umtx m;

void *
lock_test(void *arg)
{
int i = 0;

for (i = 0; i < 10000; ++i) {
lock_mtx(&m);
pthread_yield();
unlock_mtx(&m);
}

return (0);
}

int main()
{
pthread_t td1, td2;

pthread_create(&td1, NULL, lock_test, NULL);
pthread_create(&td2, NULL, lock_test, NULL);

pthread_join(td1, NULL);
pthread_join(td2, NULL);
return (0);
}
David Xu
2005-12-04 03:34:18 UTC
Permalink
Post by David Xu
...
void
unlock_mtx(struct umtx *mtx)
{
volatile uintptr_t *m = (volatile uintptr_t *)mtx;
for (;;) {
if (atomic_load_acq_ptr(m) == LCK_UNLOCKED)
err(1, "unlock a unlocked mutex\n");
if (atomic_load_acq_ptr(m) == LCK_LOCKED) {
if (atomic_cmpset_acq_ptr(m, LCK_LOCKED, LCK_UNLOCKED))
return;
}
if (atomic_load_acq_ptr(m) == LCK_CONTENDED) {
atomic_store_rel_ptr(m, LCK_UNLOCKED);
_umtx_op((struct umtx *)m, UMTX_OP_WAKE, 1, NULL, NULL);
OOP, should be:
_umtx_op((struct umtx *)m, UMTX_OP_WAKE, INT_MAX, NULL, NULL);

This line is not very optimal if there are lots of thread waiting there.
:-)

There is optimal version using transaction id:
http://www.dragonflybsd.org/cvsweb/src/lib/libthread_xu/thread/thr_umtx.c?rev=1.2&content-type=text/x-cvsweb-markup
Though, libthr in freebsd does not use these semantices, instead they
are implemented in kernel.

David Xu
Jason Evans
2005-12-04 03:46:01 UTC
Permalink
Post by David Xu
The libc spinlocks are deprecated, in fact, thread libraries try to
keep track
off all spinlocks in libc and reset them in child process, they
will complain
if there are too many spinlocks, this is not very correct, but
would resolve
dead lock in real world applications (weird applications).
Because I see you have put _malloc_prefork() and _malloc_postfork()
hooks in thread libraries, I guess you want to manage all malloc locks, so
you might don't need to use the spinlocks, you can implement these
locks by using umtx provided by kernel, you can use UMTX_OP_WAIT
and UMTX_OP_WAKE to implement these locks, the UMTX_OP_LOCK
and UMTX_OP_UNLOCK can also be used to implement locks, but I reserve
these two functions since I have plan to implement reliable POSIX process
shared mutex. you can find those code in libthr to study how to use umtx.
Last, I don't know if umtx will work with libc_r, but libc_r has
already been
disconneted from world for some days, it will rot away.
I just need simple (low overhead) mutexes that don't cause malloc to
be called during their initialization. I would have used
pthread_mutex_* directly, but cannot due to infinite recursion
problems during initialization.

As you pointed out, it's important to get priority inheritance right
in order to avoid priority inversion deadlock, so my hand-rolled
spinlocks weren't adequate. I need mutexes that are managed by the
threads library. The libc spinlocks appear to fit the bill perfectly
in that capacity. It seems to me that using umtx would actually be
the wrong thing to do, because I'd be circumventing libpthread's
userland scheduler, and it would be the wrong thing for libc_r, as
you pointed out. This approach would work for libthr, but perhaps
nothing else?

I'd like to keep things as simple and general as possible. Is the
current implementation that uses libc spinlocks acceptable?

Thanks,
Jason

P.S. Why are libc spinlocks deprecated?
David Xu
2005-12-04 09:54:53 UTC
Permalink
I just need simple (low overhead) mutexes that don't cause malloc to be
called during their initialization.
umtx is light weight and fast and need not malloc.
I would have used pthread_mutex_*
directly, but cannot due to infinite recursion problems during
initialization.
Yes, I know current pthread_mutex implementations use malloc,
I don't think it will be changed to avoid using malloc very soon.
As you pointed out, it's important to get priority inheritance right in
order to avoid priority inversion deadlock, so my hand-rolled spinlocks
weren't adequate. I need mutexes that are managed by the threads
library. The libc spinlocks appear to fit the bill perfectly in that
capacity. It seems to me that using umtx would actually be the wrong
thing to do, because I'd be circumventing libpthread's userland
scheduler, and it would be the wrong thing for libc_r, as you pointed
out. This approach would work for libthr, but perhaps nothing else?
umtx will work with libpthread, I can not find any reason why using umtx
will cause deadlock, the userland scheduler can not propagate its
priority decision cross kernel, and umtx is a blockable syscall.
I'd like to keep things as simple and general as possible. Is the
current implementation that uses libc spinlocks acceptable?
Thanks,
Jason
P.S. Why are libc spinlocks deprecated?
Because we want other libraries use pthread mutex, if it can not be
used widely and we have to use spinlock, it is really a bad taste.
I think only the malloc has recursive problem. I tell you the fact,
libpthread needs malloc to initialize spinlock, so you can not create
spinlock dynamically in your malloc code. only libthr does not have the
problem. libc_r also has priority inversion problem with your current
mutex code.

Regards,
David Xu
Claus Guttesen
2005-12-03 20:45:53 UTC
Permalink
Post by Jason Evans
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051202b.diff
When I do a make buildworld I get:

cc -fpic -DPIC -O2 -fno-strict-aliasing -pipe -march=athlon64
-I/usr/src/lib/libc/include -I/usr/src/lib/libc/../../include
-I/usr/src/lib/libc/amd64 -D__DBINTERFACE_PRIVATE
-I/usr/src/lib/libc/../../contrib/gdtoa -DINET6
-I/usr/obj/usr/src/lib/libc -DPOSIX_MISTAKE -I/usr/src/lib/libc/locale
-DBROKEN_DES -DPORTMAP -DDES_BUILTIN -I/usr/src/lib/libc/rpc -DYP
-Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -c
/usr/src/lib/libc/stdlib/malloc.c -o malloc.So
cc -O2 -fno-strict-aliasing -pipe -march=athlon64
-I/usr/src/lib/libc/include -I/usr/src/lib/libc/../../include
-I/usr/src/lib/libc/amd64 -D__DBINTERFACE_PRIVATE
-I/usr/src/lib/libc/../../contrib/gdtoa -DINET6
-I/usr/obj/usr/src/lib/libc -DPOSIX_MISTAKE -I/usr/src/lib/libc/locale
-DBROKEN_DES -DPORTMAP -DDES_BUILTIN -I/usr/src/lib/libc/rpc -DYP
-Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -c
/usr/src/lib/libc/stdlib/merge.c
/usr/src/lib/libc/stdlib/malloc.c: In function `malloc_mutex_lock':
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:853: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c: In function `malloc_mutex_unlock':
/usr/src/lib/libc/stdlib/malloc.c:894: warning: cast from pointer to
integer of different size
*** Error code 1
/usr/src/lib/libc/stdlib/malloc.c: In function `malloc_mutex_lock':
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:853: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c: In function `malloc_mutex_unlock':
/usr/src/lib/libc/stdlib/malloc.c:894: warning: cast from pointer to
integer of different size
*** Error code 1
2 errors
*** Error code 2
1 error
*** Error code 2
1 error
*** Error code 2
1 error
*** Error code 2
1 error
*** Error code 2
1 error
make -j 3 buildworld 763,54s user 150,98s system 173% cpu 8:48,62 total

twin/usr/src#>uname -a
FreeBSD twin.gnome.no 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Thu Dec 1
21:38:11 CET 2005 ***@twin.gnome.no:/usr/obj/usr/src/sys/TWIN
amd64

regards
Claus
Jason Evans
2005-12-03 21:04:38 UTC
Permalink
Post by Claus Guttesen
Post by Jason Evans
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051202b.diff
cc -fpic -DPIC -O2 -fno-strict-aliasing -pipe -march=athlon64
-I/usr/src/lib/libc/include -I/usr/src/lib/libc/../../include
-I/usr/src/lib/libc/amd64 -D__DBINTERFACE_PRIVATE
-I/usr/src/lib/libc/../../contrib/gdtoa -DINET6
-I/usr/obj/usr/src/lib/libc -DPOSIX_MISTAKE -I/usr/src/lib/libc/locale
-DBROKEN_DES -DPORTMAP -DDES_BUILTIN -I/usr/src/lib/libc/rpc -DYP
-Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -c
/usr/src/lib/libc/stdlib/malloc.c -o malloc.So
cc -O2 -fno-strict-aliasing -pipe -march=athlon64
-I/usr/src/lib/libc/include -I/usr/src/lib/libc/../../include
-I/usr/src/lib/libc/amd64 -D__DBINTERFACE_PRIVATE
-I/usr/src/lib/libc/../../contrib/gdtoa -DINET6
-I/usr/obj/usr/src/lib/libc -DPOSIX_MISTAKE -I/usr/src/lib/libc/locale
-DBROKEN_DES -DPORTMAP -DDES_BUILTIN -I/usr/src/lib/libc/rpc -DYP
-Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -c
/usr/src/lib/libc/stdlib/merge.c
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:853: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:894: warning: cast from pointer to
integer of different size
*** Error code 1
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:846: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:853: warning: cast from pointer to
integer of different size
/usr/src/lib/libc/stdlib/malloc.c:894: warning: cast from pointer to
integer of different size
*** Error code 1
2 errors
*** Error code 2
1 error
*** Error code 2
1 error
*** Error code 2
1 error
*** Error code 2
1 error
*** Error code 2
1 error
make -j 3 buildworld 763,54s user 150,98s system 173% cpu 8:48,62 total
twin/usr/src#>uname -a
FreeBSD twin.gnome.no 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Thu Dec 1
amd64
Did you use the 20051202b patch? I thought I had fixed the problem,
but I don't have an amd64 system to test on. In any case, I'll be
uploading up a new patch in a few minutes that removes the offending
code entirely.

Thanks,
Jason
Claus Guttesen
2005-12-03 21:15:58 UTC
Permalink
Post by Jason Evans
Did you use the 20051202b patch? I thought I had fixed the problem,
but I don't have an amd64 system to test on. In any case, I'll be
uploading up a new patch in a few minutes that removes the offending
code entirely.
Yes. I'll do the test that you want me to do :-) Thank you!

regards
Claus
Claus Guttesen
2005-12-04 12:51:39 UTC
Permalink
Post by Jason Evans
Did you use the 20051202b patch? I thought I had fixed the problem,
but I don't have an amd64 system to test on. In any case, I'll be
uploading up a new patch in a few minutes that removes the offending
code entirely.
I was able to do a buildworld on current with this patch, but I had
problems getting X to run and kldxref took all my space on the
root-partition doing a installkernel. So I downgraded to 6.0 stable
and get this error:

===> libexec/atrun (all)
cc -O2 -fno-strict-aliasing -pipe -march=athlon64
-DATJOB_DIR=\"/var/at/jobs/\" -DLFILE=\"/var/at/jobs/.lockfile\"
-DLOADAVG_MX=1.5 -DATSPOOL_DIR=\"/var/at/spool\" -DVERSION=\"2.9\"
-DDAEMON_UID=1 -DDAEMON_GID=1 -DDEFAULT_BATCH_QUEUE=\'E\'
-DDEFAULT_AT_QUEUE=\'c\' -DPERM_PATH=\"/var/at/\"
-I/usr/src/libexec/atrun/../../usr.bin/at -I/usr/src/libexec/atrun -c
/usr/src/libexec/atrun/atrun.c
cc -O2 -fno-strict-aliasing -pipe -march=athlon64
-DATJOB_DIR=\"/var/at/jobs/\" -DLFILE=\"/var/at/jobs/.lockfile\"
-DLOADAVG_MX=1.5 -DATSPOOL_DIR=\"/var/at/spool\" -DVERSION=\"2.9\"
-DDAEMON_UID=1 -DDAEMON_GID=1 -DDEFAULT_BATCH_QUEUE=\'E\'
-DDEFAULT_AT_QUEUE=\'c\' -DPERM_PATH=\"/var/at/\"
-I/usr/src/libexec/atrun/../../usr.bin/at -I/usr/src/libexec/atrun -c
/usr/src/libexec/atrun/gloadavg.c
cc -O2 -fno-strict-aliasing -pipe -march=athlon64
-DATJOB_DIR=\"/var/at/jobs/\" -DLFILE=\"/var/at/jobs/.lockfile\"
-DLOADAVG_MX=1.5 -DATSPOOL_DIR=\"/var/at/spool\" -DVERSION=\"2.9\"
-DDAEMON_UID=1 -DDAEMON_GID=1 -DDEFAULT_BATCH_QUEUE=\'E\'
-DDEFAULT_AT_QUEUE=\'c\' -DPERM_PATH=\"/var/at/\"
-I/usr/src/libexec/atrun/../../usr.bin/at -I/usr/src/libexec/atrun
-o atrun atrun.o gloadavg.o
/usr/obj/usr/src/tmp/usr/lib/libc.so: undefined reference to `calloc'
/usr/obj/usr/src/tmp/usr/lib/libc.so: undefined reference to `posix_memalign'
*** Error code 1

Stop in /usr/src/libexec/atrun.
*** Error code 1

Stop in /usr/src/libexec.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.
make buildworld 1122,93s user 217,28s system 84% cpu 26:18,72 total

twin/usr/src#>uname -a
FreeBSD twin.gnome.no 6.0-STABLE FreeBSD 6.0-STABLE #0: Sun Dec 4
01:18:58 CET 2005 ***@twin.gnome.no:/usr/obj/usr/src/sys/TWIN
amd64


regards
Claus
Jason Evans
2005-12-12 00:04:09 UTC
Permalink
Post by Claus Guttesen
I was able to do a buildworld on current with this patch, but I had
problems getting X to run and kldxref took all my space on the
root-partition doing a installkernel.
I've fixed the offending bug in kldxref in the latest patch:

http://www.canonware.com/~jasone/jemalloc/jemalloc_20051211b.diff

I spent several hours poking at X, but was never able to determine
why it goes into an infinite loop. The infinite loop happens rather
early, during the load of the libbitmap module. My best guess is
that it is stuck trying to acquire the Xlib lock, but cannot be sure,
since I don't know how to get debug symbols for the loaded X module.
In any case, malloc is nowhere in the backtrace. I do not have the
time to acquire the X expertise that is likely needed to track down
this problem. Hopefully someone else will be willing to do so.

No new problems in the malloc code have been found for some time
now. It has been tested on i386, sparc64, arm, and amd64. In my
opinion, the malloc patch is ready to be committed. I am now working
on the assumption that new problems are more likely application bugs
than malloc bugs. This seems like a good time to start sharing the
debugging load with the community. =)

So, how about it? Can this patch go in now?

Thanks,
Jason
Julian Elischer
2005-12-12 00:35:38 UTC
Permalink
Post by Jason Evans
Post by Claus Guttesen
I was able to do a buildworld on current with this patch, but I had
problems getting X to run and kldxref took all my space on the
root-partition doing a installkernel.
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051211b.diff
I spent several hours poking at X, but was never able to determine
why it goes into an infinite loop. The infinite loop happens rather
early, during the load of the libbitmap module. My best guess is
that it is stuck trying to acquire the Xlib lock, but cannot be sure,
since I don't know how to get debug symbols for the loaded X module.
In any case, malloc is nowhere in the backtrace. I do not have the
time to acquire the X expertise that is likely needed to track down
this problem. Hopefully someone else will be willing to do so.
No new problems in the malloc code have been found for some time
now. It has been tested on i386, sparc64, arm, and amd64. In my
opinion, the malloc patch is ready to be committed. I am now working
on the assumption that new problems are more likely application bugs
than malloc bugs. This seems like a good time to start sharing the
debugging load with the community. =)
So, how about it? Can this patch go in now?
I may have missed it but some benchmark numbers could be good.

Is there no way to make it an option for a while?
that would get good testing AND a fallback for people.
Post by Jason Evans
Thanks,
Jason
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to
David Xu
2005-12-12 00:50:01 UTC
Permalink
Post by Julian Elischer
Post by Jason Evans
No new problems in the malloc code have been found for some time
now. It has been tested on i386, sparc64, arm, and amd64. In my
opinion, the malloc patch is ready to be committed. I am now
working on the assumption that new problems are more likely
application bugs than malloc bugs. This seems like a good time to
start sharing the debugging load with the community. =)
So, how about it? Can this patch go in now?
I may have missed it but some benchmark numbers could be good.
Is there no way to make it an option for a while?
that would get good testing AND a fallback for people.
I also would like to see any benchmark number, in fact, I had plan
to import ptmalloc in the past, the malloc problem had been discussed
several times in thread@ list.
Also, it would be nice if a fallback can be provided :-)

David Xu
Kris Kennaway
2005-12-12 01:29:07 UTC
Permalink
Post by David Xu
Post by Julian Elischer
Post by Jason Evans
No new problems in the malloc code have been found for some time
now. It has been tested on i386, sparc64, arm, and amd64. In my
opinion, the malloc patch is ready to be committed. I am now
working on the assumption that new problems are more likely
application bugs than malloc bugs. This seems like a good time to
start sharing the debugging load with the community. =)
So, how about it? Can this patch go in now?
I may have missed it but some benchmark numbers could be good.
Is there no way to make it an option for a while?
that would get good testing AND a fallback for people.
I also would like to see any benchmark number, in fact, I had plan
to import ptmalloc in the past, the malloc problem had been discussed
Here is the result of a benchmark that does 1K malloc()/free() with
multiple threads on a 14-CPU sparc64 machine. This is a poor test
because sparc64 doesn't have TLS support, which is needed for jemalloc
to perform well. It still shows it kicking the pants off of phkmalloc
for both single-threaded and multi-threaded malloc.

phkmalloc:

# ./malloc-test 1024 1000000 1
Starting test with 1 thread...
Thread 2114048 adjusted timing: 27.124817 seconds for 1000000 requests of 1024 bytes.

Starting test with 2 threads...
Thread 2114560 adjusted timing: 67.535854 seconds for 1000000 requests of 1024 bytes.
Thread 2114048 adjusted timing: 70.330298 seconds for 1000000 requests of 1024 bytes.

# ./malloc-test 1024 1000000 3
Starting test with 3 threads...
Thread 2114048 adjusted timing: 74.154855 seconds for 1000000 requests of 1024 bytes.
Thread 2115072 adjusted timing: 74.356363 seconds for 1000000 requests of 1024 bytes.
Thread 2114560 adjusted timing: 77.038550 seconds for 1000000 requests of 1024 bytes.

# ./malloc-test 1024 1000000 4
Starting test with 4 threads...
Thread 2115072 adjusted timing: 217.741657 seconds for 1000000 requests of 1024 bytes.
Thread 2115584 adjusted timing: 228.434310 seconds for 1000000 requests of 1024 bytes.
Thread 2114048 adjusted timing: 228.941544 seconds for 1000000 requests of 1024 bytes.
Thread 2114560 adjusted timing: 229.286134 seconds for 1000000 requests of 1024 bytes.

# ./malloc-test 1024 1000000 5
Starting test with 5 threads...
Thread 2114048 adjusted timing: 770.255000 seconds for 1000000 requests of 1024 bytes.
Thread 2115072 adjusted timing: 770.749431 seconds for 1000000 requests of 1024 bytes.
Thread 2116096 adjusted timing: 771.307654 seconds for 1000000 requests of 1024 bytes.
Thread 2114560 adjusted timing: 772.293253 seconds for 1000000 requests of 1024 bytes.
Thread 2115584 adjusted timing: 772.550847 seconds for 1000000 requests of 1024 bytes.

jemalloc:

# ./malloc-test 1024 1000000 1
Starting test with 1 thread...
Thread -1610612656 adjusted timing: 5.428918 seconds for 1000000 requests of 1024 bytes.

# ./malloc-test 1024 1000000 2
Starting test with 2 threads...
Thread -1610612656 adjusted timing: 4.840497 seconds for 1000000 requests of 1024 bytes.
Thread -1610612176 adjusted timing: 4.948382 seconds for 1000000 requests of 1024 bytes.

# ./malloc-test 1024 1000000 3
Starting test with 3 threads...
Thread -1610611696 adjusted timing: 25.065195 seconds for 1000000 requests of 1024 bytes.
Thread -1610612656 adjusted timing: 25.218103 seconds for 1000000 requests of 1024 bytes.
Thread -1610612176 adjusted timing: 25.286181 seconds for 1000000 requests of 1024 bytes.

# ./malloc-test 1024 1000000 4
Starting test with 4 threads...
Thread -1610612656 adjusted timing: 38.176479 seconds for 1000000 requests of 1024 bytes.
Thread -1610611216 adjusted timing: 38.221169 seconds for 1000000 requests of 1024 bytes.
Thread -1610611696 adjusted timing: 38.294425 seconds for 1000000 requests of 1024 bytes.
Thread -1610612176 adjusted timing: 38.320669 seconds for 1000000 requests of 1024 bytes.

# ./malloc-test 1024 1000000 5
Starting test with 5 threads...
Thread -1610611216 adjusted timing: 50.376766 seconds for 1000000 requests of 1024 bytes.
Thread -1610612656 adjusted timing: 50.435407 seconds for 1000000 requests of 1024 bytes.
Thread -1610611696 adjusted timing: 50.885393 seconds for 1000000 requests of 1024 bytes.
Thread -1610610736 adjusted timing: 50.943412 seconds for 1000000 requests of 1024 bytes.
Thread -1610612176 adjusted timing: 50.953694 seconds for 1000000 requests of 1024 bytes.

i.e. jemalloc is a factor of 5 times faster for single-threaded
malloc, and about 15 times faster than phkmalloc for 5 threads. You
see the effect of the missing TLS on sparc64 in the scaling
(i.e. performance should be even better with multiple threads), and
with some large performance variation with larger numbers of threads
(probably due to hash collisions):

# ./malloc-test 1024 1000000 20
Starting test with 20 threads...
Thread -1610604016 adjusted timing: 48.297304 seconds for 1000000 requests of 1024 bytes.
Thread -1610604496 adjusted timing: 104.249693 seconds for 1000000 requests of 1024 bytes.
Thread -1610602496 adjusted timing: 109.578616 seconds for 1000000 requests of 1024 bytes.
Thread -1610607856 adjusted timing: 252.337973 seconds for 1000000 requests of 1024 bytes.
Thread -1610610736 adjusted timing: 254.338225 seconds for 1000000 requests of 1024 bytes.
Thread -1610606896 adjusted timing: 255.015353 seconds for 1000000 requests of 1024 bytes.
Thread -1610607376 adjusted timing: 257.463410 seconds for 1000000 requests of 1024 bytes.
Thread -1610609776 adjusted timing: 257.848283 seconds for 1000000 requests of 1024 bytes.
Thread -1610605936 adjusted timing: 257.955005 seconds for 1000000 requests of 1024 bytes.
Thread -1610604976 adjusted timing: 259.303220 seconds for 1000000 requests of 1024 bytes.
Thread -1610611216 adjusted timing: 259.610871 seconds for 1000000 requests of 1024 bytes.
Thread -1610606416 adjusted timing: 260.622687 seconds for 1000000 requests of 1024 bytes.
Thread -1610611696 adjusted timing: 260.857706 seconds for 1000000 requests of 1024 bytes.
Thread -1610610256 adjusted timing: 261.056716 seconds for 1000000 requests of 1024 bytes.
Thread -1610608816 adjusted timing: 261.764455 seconds for 1000000 requests of 1024 bytes.
Thread -1610609296 adjusted timing: 261.800319 seconds for 1000000 requests of 1024 bytes.
Thread -1610605456 adjusted timing: 261.748707 seconds for 1000000 requests of 1024 bytes.
Thread -1610612176 adjusted timing: 262.108598 seconds for 1000000 requests of 1024 bytes.
Thread -1610608336 adjusted timing: 262.119440 seconds for 1000000 requests of 1024 bytes.
Thread -1610612656 adjusted timing: 262.315112 seconds for 1000000 requests of 1024 bytes.

I'll try to test this on a 4 CPU amd64 machine next.

Kris
Kris Kennaway
2005-12-12 04:30:23 UTC
Permalink
Post by Kris Kennaway
I'll try to test this on a 4 CPU amd64 machine next.
phkmalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 5298176 adjusted timing: 4.173052 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 5299200 adjusted timing: 325.108643 seconds for 10000000 requests of 1024 bytes.
Thread 5298176 adjusted timing: 325.202485 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 5414912 adjusted timing: 1133.238459 seconds for 10000000 requests of 1024 bytes.
Thread 5299200 adjusted timing: 1134.525255 seconds for 10000000 requests of 1024 bytes.
Thread 5298176 adjusted timing: 1134.539555 seconds for 10000000 requests of 1024 bytes.

jemalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 1073760528 adjusted timing: 3.777175 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 1073760560 adjusted timing: 3.851702 seconds for 10000000 requests of 1024 bytes.
Thread 1073761584 adjusted timing: 3.887943 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 1073760528 adjusted timing: 3.866206 seconds for 10000000 requests of 1024 bytes.
Thread 1073761552 adjusted timing: 13.382795 seconds for 10000000 requests of 1024 bytes.
Thread 1073762688 adjusted timing: 14.407229 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
Thread 1073760528 adjusted timing: 3.782923 seconds for 10000000 requests of 1024 bytes.
Thread 1073763792 adjusted timing: 6.668655 seconds for 10000000 requests of 1024 bytes.
Thread 1073762688 adjusted timing: 14.346569 seconds for 10000000 requests of 1024 bytes.
Thread 1073761584 adjusted timing: 14.680211 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
Thread 1073760560 adjusted timing: 4.748248 seconds for 10000000 requests of 1024 bytes.
Thread 1073761584 adjusted timing: 9.898153 seconds for 10000000 requests of 1024 bytes.
Thread 1073764896 adjusted timing: 13.019884 seconds for 10000000 requests of 1024 bytes.
Thread 1073762688 adjusted timing: 15.326908 seconds for 10000000 requests of 1024 bytes.
Thread 1073763792 adjusted timing: 15.442164 seconds for 10000000 requests of 1024 bytes.

So it's 1.1 times faster for single-threaded, and 107 times faster
with 3 threads.

With libthr instead of libpthread:

phkmalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 5255680 adjusted timing: 2.357247 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 5256192 adjusted timing: 10.964918 seconds for 10000000 requests of 1024 bytes.
Thread 5255680 adjusted timing: 11.001288 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 5255680 adjusted timing: 17.467754 seconds for 10000000 requests of 1024 bytes.
Thread 5256704 adjusted timing: 17.724583 seconds for 10000000 requests of 1024 bytes.
Thread 5256192 adjusted timing: 17.913381 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
Thread 5255680 adjusted timing: 42.715420 seconds for 10000000 requests of 1024 bytes.
Thread 5256192 adjusted timing: 43.481252 seconds for 10000000 requests of 1024 bytes.
Thread 5256704 adjusted timing: 43.871452 seconds for 10000000 requests of 1024 bytes.
Thread 5257216 adjusted timing: 43.887820 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
Thread 5255680 adjusted timing: 139.316332 seconds for 10000000 requests of 1024 bytes.
Thread 5257216 adjusted timing: 140.117720 seconds for 10000000 requests of 1024 bytes.
Thread 5256192 adjusted timing: 140.134057 seconds for 10000000 requests of 1024 bytes.
Thread 5256704 adjusted timing: 140.855289 seconds for 10000000 requests of 1024 bytes.
Thread 5257728 adjusted timing: 140.865934 seconds for 10000000 requests of 1024 bytes.

jemalloc:

# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 1073742416 adjusted timing: 1.366353 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 1073742416 adjusted timing: 1.429485 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.530733 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 1073742416 adjusted timing: 1.419813 seconds for 10000000 requests of 1024 bytes.
Thread 1073743376 adjusted timing: 1.432790 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.490218 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
Thread 1073743376 adjusted timing: 1.447554 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 1.503659 seconds for 10000000 requests of 1024 bytes.
Thread 1073743856 adjusted timing: 1.503937 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.504926 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
Thread 1073743376 adjusted timing: 1.595239 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.689753 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 1.750115 seconds for 10000000 requests of 1024 bytes.
Thread 1073744336 adjusted timing: 1.744271 seconds for 10000000 requests of 1024 bytes.
Thread 1073743856 adjusted timing: 1.890269 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 6
Starting test with 6 threads...
Thread 1073743856 adjusted timing: 1.847653 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 2.018481 seconds for 10000000 requests of 1024 bytes.
Thread 1073743376 adjusted timing: 2.059817 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 2.129204 seconds for 10000000 requests of 1024 bytes.
Thread 1073744336 adjusted timing: 2.223751 seconds for 10000000 requests of 1024 bytes.
Thread 1073744816 adjusted timing: 2.293809 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 20
Starting test with 20 threads...
Thread 1073744816 adjusted timing: 5.113769 seconds for 10000000 requests of 1024 bytes.
Thread 1073751136 adjusted timing: 4.973369 seconds for 10000000 requests of 1024 bytes.
Thread 1073750176 adjusted timing: 5.295912 seconds for 10000000 requests of 1024 bytes.
Thread 1073745296 adjusted timing: 5.502331 seconds for 10000000 requests of 1024 bytes.
Thread 1073743856 adjusted timing: 5.614890 seconds for 10000000 requests of 1024 bytes.
Thread 1073744336 adjusted timing: 5.608690 seconds for 10000000 requests of 1024 bytes.
Thread 1073752096 adjusted timing: 5.555465 seconds for 10000000 requests of 1024 bytes.
Thread 1073748736 adjusted timing: 5.650922 seconds for 10000000 requests of 1024 bytes.
Thread 1073748256 adjusted timing: 6.608054 seconds for 10000000 requests of 1024 bytes.
Thread 1073750656 adjusted timing: 7.144998 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 7.390905 seconds for 10000000 requests of 1024 bytes.
Thread 1073746256 adjusted timing: 7.364728 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 7.556064 seconds for 10000000 requests of 1024 bytes.
Thread 1073749216 adjusted timing: 7.357179 seconds for 10000000 requests of 1024 bytes.
Thread 1073752576 adjusted timing: 7.349483 seconds for 10000000 requests of 1024 bytes.
c Thread 1073747776 adjusted timing: 7.375179 seconds for 10000000 requests of 1024 bytes.
Thread 1073751616 adjusted timing: 7.557854 seconds for 10000000 requests of 1024 bytes.
Thread 1073743376 adjusted timing: 7.915978 seconds for 10000000 requests of 1024 bytes.
Thread 1073749696 adjusted timing: 7.795219 seconds for 10000000 requests of 1024 bytes.
Thread 1073745776 adjusted timing: 8.007392 seconds for 10000000 requests of 1024 bytes.

So libthr is *much* faster than libpthread with both malloc
implementations, but jemalloc is still 1.7 times faster for 1 thread
and 80 times faster for 5 threads than phkmalloc.

Kris

P.S. Holy crap :)
Johan Bucht
2005-12-12 05:04:15 UTC
Permalink
Post by Kris Kennaway
Post by Kris Kennaway
I'll try to test this on a 4 CPU amd64 machine next.
Thanks for the time.
Post by Kris Kennaway
# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 5298176 adjusted timing: 4.173052 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 5299200 adjusted timing: 325.108643 seconds for 10000000 requests of 1024 bytes.
Thread 5298176 adjusted timing: 325.202485 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 5414912 adjusted timing: 1133.238459 seconds for 10000000 requests of 1024 bytes.
Thread 5299200 adjusted timing: 1134.525255 seconds for 10000000 requests of 1024 bytes.
Thread 5298176 adjusted timing: 1134.539555 seconds for 10000000 requests of 1024 bytes.
Those times seems way too high even for a serial allocator, is
libpthread performance really this bad on amd64 or is it broken?
Post by Kris Kennaway
# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 1073760528 adjusted timing: 3.777175 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 1073760560 adjusted timing: 3.851702 seconds for 10000000 requests of 1024 bytes.
Thread 1073761584 adjusted timing: 3.887943 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 1073760528 adjusted timing: 3.866206 seconds for 10000000 requests of 1024 bytes.
Thread 1073761552 adjusted timing: 13.382795 seconds for 10000000 requests of 1024 bytes.
Thread 1073762688 adjusted timing: 14.407229 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
Thread 1073760528 adjusted timing: 3.782923 seconds for 10000000 requests of 1024 bytes.
Thread 1073763792 adjusted timing: 6.668655 seconds for 10000000 requests of 1024 bytes.
Thread 1073762688 adjusted timing: 14.346569 seconds for 10000000 requests of 1024 bytes.
Thread 1073761584 adjusted timing: 14.680211 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
Thread 1073760560 adjusted timing: 4.748248 seconds for 10000000 requests of 1024 bytes.
Thread 1073761584 adjusted timing: 9.898153 seconds for 10000000 requests of 1024 bytes.
Thread 1073764896 adjusted timing: 13.019884 seconds for 10000000 requests of 1024 bytes.
Thread 1073762688 adjusted timing: 15.326908 seconds for 10000000 requests of 1024 bytes.
Thread 1073763792 adjusted timing: 15.442164 seconds for 10000000 requests of 1024 bytes.
So it's 1.1 times faster for single-threaded, and 107 times faster
with 3 threads.
The problem with thread scheduling under 4bsd as reported earlier making
the first thread getting higher priority than the later threads, makes
these numbers a bit strange.
Post by Kris Kennaway
# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 5255680 adjusted timing: 2.357247 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 5256192 adjusted timing: 10.964918 seconds for 10000000 requests of 1024 bytes.
Thread 5255680 adjusted timing: 11.001288 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 5255680 adjusted timing: 17.467754 seconds for 10000000 requests of 1024 bytes.
Thread 5256704 adjusted timing: 17.724583 seconds for 10000000 requests of 1024 bytes.
Thread 5256192 adjusted timing: 17.913381 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
Thread 5255680 adjusted timing: 42.715420 seconds for 10000000 requests of 1024 bytes.
Thread 5256192 adjusted timing: 43.481252 seconds for 10000000 requests of 1024 bytes.
Thread 5256704 adjusted timing: 43.871452 seconds for 10000000 requests of 1024 bytes.
Thread 5257216 adjusted timing: 43.887820 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
Thread 5255680 adjusted timing: 139.316332 seconds for 10000000 requests of 1024 bytes.
Thread 5257216 adjusted timing: 140.117720 seconds for 10000000 requests of 1024 bytes.
Thread 5256192 adjusted timing: 140.134057 seconds for 10000000 requests of 1024 bytes.
Thread 5256704 adjusted timing: 140.855289 seconds for 10000000 requests of 1024 bytes.
Thread 5257728 adjusted timing: 140.865934 seconds for 10000000 requests of 1024 bytes.
Looks reasonable for a serial allocator.
Post by Kris Kennaway
# ./malloc-test 1024 10000000 1
Starting test with 1 thread...
Thread 1073742416 adjusted timing: 1.366353 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 2
Starting test with 2 threads...
Thread 1073742416 adjusted timing: 1.429485 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.530733 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 3
Starting test with 3 threads...
Thread 1073742416 adjusted timing: 1.419813 seconds for 10000000 requests of 1024 bytes.
Thread 1073743376 adjusted timing: 1.432790 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.490218 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 4
Starting test with 4 threads...
Thread 1073743376 adjusted timing: 1.447554 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 1.503659 seconds for 10000000 requests of 1024 bytes.
Thread 1073743856 adjusted timing: 1.503937 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.504926 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 5
Starting test with 5 threads...
Thread 1073743376 adjusted timing: 1.595239 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 1.689753 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 1.750115 seconds for 10000000 requests of 1024 bytes.
Thread 1073744336 adjusted timing: 1.744271 seconds for 10000000 requests of 1024 bytes.
Thread 1073743856 adjusted timing: 1.890269 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 6
Starting test with 6 threads...
Thread 1073743856 adjusted timing: 1.847653 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 2.018481 seconds for 10000000 requests of 1024 bytes.
Thread 1073743376 adjusted timing: 2.059817 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 2.129204 seconds for 10000000 requests of 1024 bytes.
Thread 1073744336 adjusted timing: 2.223751 seconds for 10000000 requests of 1024 bytes.
Thread 1073744816 adjusted timing: 2.293809 seconds for 10000000 requests of 1024 bytes.
# ./malloc-test 1024 10000000 20
Starting test with 20 threads...
Thread 1073744816 adjusted timing: 5.113769 seconds for 10000000 requests of 1024 bytes.
Thread 1073751136 adjusted timing: 4.973369 seconds for 10000000 requests of 1024 bytes.
Thread 1073750176 adjusted timing: 5.295912 seconds for 10000000 requests of 1024 bytes.
Thread 1073745296 adjusted timing: 5.502331 seconds for 10000000 requests of 1024 bytes.
Thread 1073743856 adjusted timing: 5.614890 seconds for 10000000 requests of 1024 bytes.
Thread 1073744336 adjusted timing: 5.608690 seconds for 10000000 requests of 1024 bytes.
Thread 1073752096 adjusted timing: 5.555465 seconds for 10000000 requests of 1024 bytes.
Thread 1073748736 adjusted timing: 5.650922 seconds for 10000000 requests of 1024 bytes.
Thread 1073748256 adjusted timing: 6.608054 seconds for 10000000 requests of 1024 bytes.
Thread 1073750656 adjusted timing: 7.144998 seconds for 10000000 requests of 1024 bytes.
Thread 1073742896 adjusted timing: 7.390905 seconds for 10000000 requests of 1024 bytes.
Thread 1073746256 adjusted timing: 7.364728 seconds for 10000000 requests of 1024 bytes.
Thread 1073742416 adjusted timing: 7.556064 seconds for 10000000 requests of 1024 bytes.
Thread 1073749216 adjusted timing: 7.357179 seconds for 10000000 requests of 1024 bytes.
Thread 1073752576 adjusted timing: 7.349483 seconds for 10000000 requests of 1024 bytes.
c Thread 1073747776 adjusted timing: 7.375179 seconds for 10000000 requests of 1024 bytes.
Thread 1073751616 adjusted timing: 7.557854 seconds for 10000000 requests of 1024 bytes.
Thread 1073743376 adjusted timing: 7.915978 seconds for 10000000 requests of 1024 bytes.
Thread 1073749696 adjusted timing: 7.795219 seconds for 10000000 requests of 1024 bytes.
Thread 1073745776 adjusted timing: 8.007392 seconds for 10000000 requests of 1024 bytes.
Seems to experience the same scheduling issues but to a lesser extent.
Post by Kris Kennaway
So libthr is *much* faster than libpthread with both malloc
implementations, but jemalloc is still 1.7 times faster for 1 thread
and 80 times faster for 5 threads than phkmalloc.
This test simply tests the local arena performance making it the worst
case for serial allocators as all threads contend for the same lock.
At the same time this is the best case scenario for jemalloc as all
memory resides in the local arena. This means no contention at all
unless the threads get hashed into the same arena.
Basicly you are comparing worst case of phkmalloc vs best case of
jemalloc. =)

Would be nice if someone could run some supersmack benchmarks.
Post by Kris Kennaway
Kris
P.S. Holy crap :)
Kris Kennaway
2005-12-12 05:58:04 UTC
Permalink
Post by Johan Bucht
Those times seems way too high even for a serial allocator, is
libpthread performance really this bad on amd64 or is it broken?
Not sure which thread library has been tested.
The quoted text referred to libpthread, the default on amd64.

Kris
David Xu
2005-12-12 07:18:01 UTC
Permalink
Post by Kris Kennaway
Post by Johan Bucht
Those times seems way too high even for a serial allocator, is
libpthread performance really this bad on amd64 or is it broken?
Not sure which thread library has been tested.
The quoted text referred to libpthread, the default on amd64.
Kris
May try libthr too ? :-)
Kris Kennaway
2005-12-12 07:25:54 UTC
Permalink
Post by David Xu
Post by Kris Kennaway
Post by Johan Bucht
Those times seems way too high even for a serial allocator, is
libpthread performance really this bad on amd64 or is it broken?
Not sure which thread library has been tested.
The quoted text referred to libpthread, the default on amd64.
Kris
May try libthr too ? :-)
That was in the other half of the mail, I guess you didn't read it
thoroughly :)

Kris
David Xu
2005-12-12 05:56:09 UTC
Permalink
Post by Johan Bucht
Those times seems way too high even for a serial allocator, is
libpthread performance really this bad on amd64 or is it broken?
Not sure which thread library has been tested.
Jason Evans
2005-12-12 01:47:57 UTC
Permalink
Post by Julian Elischer
I may have missed it but some benchmark numbers could be good.
I haven't posted any benchmark numbers, but that is a reasonable
request. Here's a summary of what I've seen so far.

For single-threaded apps, phkmalloc and jemalloc exhibit very similar
performance for all of the benchmarks I've run. Neither is a clear
winner over the other from what I've seen.

Kris Kennaway already posted some multi-threaded microbenchmark
results. My tests have yielded similar results to his.

It would be very informative to run benchmarks with real world
multithreaded apps. bind9 (built with threading support) would be a
great candidate, but thus far I haven't gotten a chance to use the
machines that Robert Watson uses for such tests.
Post by Julian Elischer
Is there no way to make it an option for a while?
that would get good testing AND a fallback for people.
Unfortunately, there are some low level issues that make the two
malloc implementations incompatible, and they both need access to
libc internals in order to work correctly in a multi-threaded
program. The way I have been comparing the two implementations is
via chroot installations. It might be possible to make the two
compatible (would require extra coding), but since both of them need
to be part of libc, we would need a way of building separate libc
libraries for the two mallocs. This all seems uglier than it's worth
to me. Maybe there's another way...

Jason
Dan Nelson
2005-12-12 02:03:01 UTC
Permalink
Post by Jason Evans
Is there no way to make it an option for a while? that would get
good testing AND a fallback for people.
Unfortunately, there are some low level issues that make the two
malloc implementations incompatible, and they both need access to
libc internals in order to work correctly in a multi-threaded
program. The way I have been comparing the two implementations is
via chroot installations. It might be possible to make the two
compatible (would require extra coding), but since both of them need
to be part of libc, we would need a way of building separate libc
libraries for the two mallocs. This all seems uglier than it's worth
to me. Maybe there's another way...
I have had good results by simply compiling malloc.c into a shared
library and loading it via LD_PRELOAD. Works enough to run mozilla at
least.
--
Dan Nelson
***@allantgroup.com
Doug Barton
2005-12-12 02:56:00 UTC
Permalink
Post by Jason Evans
It would be very informative to run benchmarks with real world
multithreaded apps. bind9 (built with threading support) would be a
great candidate,
Maybe someday, but not at the moment. I've been told by the folks at ISC
that in it's current form, threading is a pessimization on all versions of
BIND 9 prior to the 9.4 code that they have in CVS (which has not been
generally released). Thus, I would not expect performance of BIND 9 with
threads to be an accurate reflection of the effectiveness of the library.

hth,

Doug
--
If you're never wrong, you're not trying hard enough
Peter Jeremy
2005-12-12 07:02:31 UTC
Permalink
Post by Jason Evans
early, during the load of the libbitmap module. My best guess is
that it is stuck trying to acquire the Xlib lock, but cannot be sure,
since I don't know how to get debug symbols for the loaded X module.
You could compile all the modules into the server by adding "#define
DoLoadableServer NO" to your host.conf when you compile XFree86-4-Server
(see scripts/configure for other defines that are added).
Something similar probably works in X.org. I found that I had to
remove elo2300 from XInputDrivers to correct a unresolved 'ELO2300'.
--
Peter Jeremy
Eric Anholt
2005-12-12 07:56:01 UTC
Permalink
Post by Peter Jeremy
Post by Jason Evans
early, during the load of the libbitmap module. My best guess is
that it is stuck trying to acquire the Xlib lock, but cannot be sure,
since I don't know how to get debug symbols for the loaded X module.
You could compile all the modules into the server by adding "#define
DoLoadableServer NO" to your host.conf when you compile XFree86-4-Server
(see scripts/configure for other defines that are added).
Something similar probably works in X.org. I found that I had to
remove elo2300 from XInputDrivers to correct a unresolved 'ELO2300'.
DoLoadableServer NO may or may not work at any time, and frequently
changes bug reproducibility due to the lack of the loader process. If
you can reproduce with xorg-server-snap, then WITH_DEBUG=1 during
compile will get you a server with debug symbols.
--
Eric Anholt ***@lclark.edu
http://people.freebsd.org/~anholt/ ***@FreeBSD.org
Claus Guttesen
2005-11-30 09:02:14 UTC
Permalink
Post by Jason Evans
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051127a.diff
This implementation is very different from the current libc malloc.
Probably the most important difference is that this one is designed
with threads and SMP in mind.
Do you need current for this? I patched and tried buildworld on 6.0
stable but no go.

regards
Claus
Jason Evans
2005-11-30 22:47:32 UTC
Permalink
Post by Claus Guttesen
Post by Jason Evans
http://www.canonware.com/~jasone/jemalloc/jemalloc_20051127a.diff
Do you need current for this? I patched and tried buildworld on 6.0
stable but no go.
I started work on this before 6.0 branched, and am unaware of any
changes that would impact the patch. However, I've only used current
for the development.

Jason
Ulrich Spoerlein
2005-11-30 11:10:17 UTC
Permalink
* The virtual memory size of processes, as reported in the SIZE field by top, will appear
astronomical for almost all processes (32+ MB). This is expected; it is merely an artifact
of using large mmap()ed regions rather than sbrk().
Hi,

I just read that mmap() part and have to wonder: Is it possible to
introduce something like the guard pages that OpenBSD has implemented?
I'd love to try this out and see the dozens of applications that fail
due to off-by-one bugs.

If the security features of OpenBSDs new malloc() could be implemented
as new MALLOC_OPTIONS directives, that would be fantastic!

Ulrich Spoerlein
--
PGP Key ID: F0DB9F44 Encrypted mail welcome!
Fingerprint: F1CE D062 0CA9 ADE3 349B 2FE8 980A C6B5 F0DB 9F44
Ok, which part of "Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn."
didn't you understand?
Poul-Henning Kamp
2005-11-30 11:18:26 UTC
Permalink
Post by Ulrich Spoerlein
I just read that mmap() part and have to wonder: Is it possible to
introduce something like the guard pages that OpenBSD has implemented?
I'd love to try this out and see the dozens of applications that fail
due to off-by-one bugs.
Guard-pages are very expensive and that is why I have not adopted
OpenBSD's patch.

I would advocate that people use one of the dedicated debugging malloc
implementations (ElectricFence ?) instead of putting too much overhead
into our default malloc.

For all practical purposes, the options J, A, X & Z are the most commonly
used.
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Daniel O'Connor
2005-11-30 12:21:52 UTC
Permalink
Post by Poul-Henning Kamp
Post by Ulrich Spoerlein
I just read that mmap() part and have to wonder: Is it possible to
introduce something like the guard pages that OpenBSD has implemented?
I'd love to try this out and see the dozens of applications that fail
due to off-by-one bugs.
Guard-pages are very expensive and that is why I have not adopted
OpenBSD's patch.
I would advocate that people use one of the dedicated debugging malloc
implementations (ElectricFence ?) instead of putting too much overhead
into our default malloc.
Electric fence is right. Although it IS slow, an order of magnitude or more
usually. Also if you do use it you'll probably have to bump up the
vm.max_proc_mmap sysctl or it will fail to allocate memory.

Another good one is valgrind (and it detects more problems to boot :)
Post by Poul-Henning Kamp
For all practical purposes, the options J, A, X & Z are the most commonly
used.
--
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
"The nice thing about standards is that there
are so many of them to choose from."
-- Andrew Tanenbaum
GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C
Ulrich Spoerlein
2005-11-30 12:30:19 UTC
Permalink
Post by Daniel O'Connor
Post by Poul-Henning Kamp
Post by Ulrich Spoerlein
I just read that mmap() part and have to wonder: Is it possible to
introduce something like the guard pages that OpenBSD has implemented?
I'd love to try this out and see the dozens of applications that fail
due to off-by-one bugs.
Guard-pages are very expensive and that is why I have not adopted
OpenBSD's patch.
Yes, of course it should be disabled as default, but if it could be
implemented so you can switch at runtime or compile time (think
INVARIANTS/WITNESS) *and* there is no penalty for the disabled case,
that be nice.
Post by Daniel O'Connor
Post by Poul-Henning Kamp
I would advocate that people use one of the dedicated debugging malloc
implementations (ElectricFence ?) instead of putting too much overhead
into our default malloc.
Electric fence is right. Although it IS slow, an order of magnitude or more
usually. Also if you do use it you'll probably have to bump up the
vm.max_proc_mmap sysctl or it will fail to allocate memory.
Another good one is valgrind (and it detects more problems to boot :)
Yes, I usualy use dmalloc and valgrind. It's sad other developers don't
use any of these tools ...

Ulrich Spoerlein
--
PGP Key ID: F0DB9F44 Encrypted mail welcome!
Fingerprint: F1CE D062 0CA9 ADE3 349B 2FE8 980A C6B5 F0DB 9F44
Ok, which part of "Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn."
didn't you understand?
Jason Evans
2005-11-30 14:32:54 UTC
Permalink
Post by Ulrich Spoerlein
Post by Poul-Henning Kamp
Post by Ulrich Spoerlein
I just read that mmap() part and have to wonder: Is it possible to
introduce something like the guard pages that OpenBSD has
implemented?
I'd love to try this out and see the dozens of applications that fail
due to off-by-one bugs.
Guard-pages are very expensive and that is why I have not adopted
OpenBSD's patch.
Yes, of course it should be disabled as default, but if it could be
implemented so you can switch at runtime or compile time (think
INVARIANTS/WITNESS) *and* there is no penalty for the disabled case,
that be nice.
In a previous version of the patch, I included compile-time support
for redzones around allocations. Kris Kennaway did a full ports tree
build with redzones enabled, and several ports caused redzone
corruption, but in every case it was due to writing one byte past the
end of an allocation. None of these were serious, since word
alignment required that the "corrupted" byte be unused. I suspect
that we would catch very few serious errors, even if redzones were
enabled by default.

Due to some unrelated performance issues, I later did a significant
rework of the internal data structures, and decided to drop redzone
support since the new data structures weren't as conducive to
redzones. Ultimately, I don't think we would have wanted to leave
this feature enabled, even for CURRENT, because it required that all
allocations be larger, thus bloating memory usage for all applications.

As a runtime-switchable feature, I think we still wouldn't want to
leave it compiled in for production systems. I spent a lot of time
looking at valgrind (cachegrind tool) profiles, and found that even
innocuous additional features such as the tracking of total allocated
memory had significant negative impacts on performance. The feature
that I really didn't want to remove, that is also important to
redzone support, was byte-exact tracking of allocation size. The
extra branches that would be required for runtime support of redzones
probably wouldn't be worth the cost.

Jason
Andrey Chernov
2005-11-30 14:39:59 UTC
Permalink
Post by Jason Evans
In a previous version of the patch, I included compile-time support
for redzones around allocations. Kris Kennaway did a full ports tree
build with redzones enabled, and several ports caused redzone
corruption, but in every case it was due to writing one byte past the
end of an allocation. None of these were serious, since word
alignment required that the "corrupted" byte be unused. I suspect
that we would catch very few serious errors, even if redzones were
enabled by default.
You can make red zones word-aligned in addition to byte-aligned variant,
both as malloc options, of course.
--
http://ache.pp.ru/
Ulrich Spoerlein
2005-11-30 14:47:09 UTC
Permalink
[Why no redzones in jemalloc]
Thanks for the elaborate explanation. Greatly appreciated.

Ulrich Spoerlein
--
PGP Key ID: F0DB9F44 Encrypted mail welcome!
Fingerprint: F1CE D062 0CA9 ADE3 349B 2FE8 980A C6B5 F0DB 9F44
Ok, which part of "Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn."
didn't you understand?
Daniel Eischen
2005-12-04 14:29:37 UTC
Permalink
Post by David Xu
I just need simple (low overhead) mutexes that don't cause malloc to be
called during their initialization.
umtx is light weight and fast and need not malloc.
I would have used pthread_mutex_*
directly, but cannot due to infinite recursion problems during
initialization.
Yes, I know current pthread_mutex implementations use malloc,
I don't think it will be changed to avoid using malloc very soon.
It's on my list of things to do.
Post by David Xu
As you pointed out, it's important to get priority inheritance right in
order to avoid priority inversion deadlock, so my hand-rolled spinlocks
weren't adequate. I need mutexes that are managed by the threads
library. The libc spinlocks appear to fit the bill perfectly in that
capacity. It seems to me that using umtx would actually be the wrong
thing to do, because I'd be circumventing libpthread's userland
scheduler, and it would be the wrong thing for libc_r, as you pointed
out. This approach would work for libthr, but perhaps nothing else?
umtx will work with libpthread, I can not find any reason why using umtx
will cause deadlock, the userland scheduler can not propagate its
priority decision cross kernel, and umtx is a blockable syscall.
The problem is userland code can exit, circumvent the unlock by
exception handling, take a signal and longjmp, etc., which may
leave locks (not known by libpthread) held. At least with
spinlocks or mutex, the thread libraries can know that the
application is in a critical region and can behave accordingly.
Libpthread will defer switching threads when they are in
critical regions (unless they are blocked).

I think that libc or other libraries that want to be thread-safe
shouldn't try to roll their own locks. The reason to do so is
that lock overhead may be deemed too great. If that is the
case, then we should fix the problem at its source ;-)
Of course, the other reason is that mutexes currently have to
be allocated.
--
DE
Johan Bucht
2005-12-12 01:48:52 UTC
Permalink
Ugh, forgot that I used another smtp server than the one subscribed to
the list. So the mail shouldn't have made it to the list, if it has I
apologize. Remembered one more thing about locking granularity too.

Just a few comments, not sure if all of them are valid since I don't
have a recent enough system to apply the patch and test it, so my
comments are just based on reading the patch. I might have missed some
thing hidden between all the lines differing. Can you put up a patched
malloc.c somewhere?

I made a proof-of-concept parallel allocator for my Master's Thesis
during the spring and some of the comments are about differences and
similarities between our implementations.

* Fork safety functions
Nice to have for all allocators and is something I missed having. Would
like to have them regardless if your malloc becomes standard or not.

* magazines, per-arena caches
If I didn't miss something you don't seem to have it, pardon if I missed
them in my quick readings. They can really help the scalability and is
something that is implemented in Solaris libumem allocator with great
success. I implemented them and they helped bring the scalability up
quite a bit on the system I tested my allocator on aswell as bringing
the allocation latency down.

* Saw you used a tree to look up the allocation sizes at free, this
* differs from how I and Solaris libumem does it. Instead I went for
* prepending a 8/16 byte tag containing the size of the allocation for
* lockless size lookup.

* Bucket sizes
If I didn't miss something you are using power of two sizes. Might be
better to use smaller sizes to avoid internal fragmentation, aswell as
improving locking granularity.

* Locking granularity
You use a single lock for the chunk allocation instead of a lock per
bucket size or something similar. This means that you limit access
unless threads allocate from their private arenas. Since you hash to get
an arena there might be multiple threads accessing the same arena at the
same time introducing contention. Might be better to have a lock per
allocation size to somewhat improve the situation.

* Locking primitive
The biggest issue and as David Xu pointed out is probably the locking
primitives. The SPINLOCK use has a limit in the threading library and
makes is hard to have a lot of mutexes. I ended up using a wrapper
around the umtx_lock function to get recursive mutexes and it would
probably be better to extend the umtx functions to handle recursion.
This would probably also be appreciated by other malloc implementations.
Might be interesting to implement some of the ideas from the Linux futex
implementation to help umtx.

* sbrk vs mmap
See you added the ability to try brk first and try mmap if brk fails.
This is similar to my implementation aswell as ptmalloc. Looks good.

* Big allocations
It might be preferable to use mmap/munmap directly for big allocations
(bigger than a few pages) to avoid the overhead of trying to be smart.
These big allocations should be pretty few and shouldn't be that
pessimized by this. It also means some memory savings as you can
directly free memory from big allocations.

* Ownership
Didn't look that much for it so might have missed it. Basicly what
happens if you free memory allocated in another arena, do you return
memory to the arena the memory was allocated from or to the current
threads arena? Not returning it to the original arena leads to cache
line sharing. Returning it to the original adds some overhead and it
might be argued that applications doing this might not be considered
scalable and should be fixed instead.

* Standalone version
Would be nice to have a standalone version to test the allocator on
solaris and linux to compare against their allocators. Would be nice for
all the people with only access to SMP machines running another OS. I
tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
helpful in testing scalability and impact on small machines aswell as
finding bugs in the implementation. I tested my allocator against
Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
and almost non working cpu fan. Beer as in standing in 10cm of beer in
my backpack =).
Not sure how much of the implementation is FreeBSD dependant, RB-trees?,
Solaris 9 has the benefit of using a local malloc inside libc which
makes the pthread library functions accessible from the malloc
implementation. I avoided touching the libc malloc as I really didn't
care at all about single threaded apps and simply used LD_PRELOAD to
override libc malloc. Basicly I throw away memory if you allocate a
multiple of the page size since I add a tag prepending the memory and I
allocate whole pages. So a 8192 byte allocation becomes a 8200 byte
allocation, wasting a page.

* Focus - Single thread vs multithread
What are the focus of the allocator, how much memory is it worth to
waste to implement a good scalable implementation. This is the reason I
think Solaris still keeps a serial allocator in libc and makes libumem
accessible through LD_PRELOAD and friends.

* Benchmarks
Would be really nice with a few benchmarks against phkmalloc, ptmalloc,
hoard and libumem to see how well it stands. Pure local arena allocation
apps, Larson benchmarks and some other. Might also be good to see how
much more memory it consumes compared to phkmalloc.

* Comment
Might be wise to fix the comment about always using mmap since brk can
be used.

* Conclussion
Despite not having tested it it looks good and would be nice to have it
in FreeBSD. Some parts are really independent on the allocator and would
be nice to have, regardless of the general consensus.
--
/Johan Bucht
***@acc.umu.se
Jason Evans
2005-12-12 02:27:14 UTC
Permalink
Johan,

Thank you for your code review! Specific responses follow.
Post by Johan Bucht
Just a few comments, not sure if all of them are valid since I don't
have a recent enough system to apply the patch and test it, so my
comments are just based on reading the patch. I might have missed some
thing hidden between all the lines differing. Can you put up a patched
malloc.c somewhere?
http://www.canonware.com/~jasone/jemalloc/malloc.c
Post by Johan Bucht
* Fork safety functions
Nice to have for all allocators and is something I missed having. Would
like to have them regardless if your malloc becomes standard or not.
I think that the implementation is currently fork-safe. The threads
libraries call _malloc_prefork() and _malloc_postfork() in order to
avoid locking issues.
Post by Johan Bucht
* magazines, per-arena caches
If I didn't miss something you don't seem to have it, pardon if I missed
them in my quick readings. They can really help the scalability and is
something that is implemented in Solaris libumem allocator with great
success. I implemented them and they helped bring the scalability up
quite a bit on the system I tested my allocator on aswell as bringing
the allocation latency down.
Arenas do use caches for objects below a size threshold (currently
~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).
The caching mechanism is quite different than magazines, but still
very effective.
Post by Johan Bucht
* Saw you used a tree to look up the allocation sizes at free, this
* differs from how I and Solaris libumem does it. Instead I went for
* prepending a 8/16 byte tag containing the size of the allocation for
* lockless size lookup.
Actually, my implementation prepends a 4 byte tag that contains
allocated/free flags, and the size of the allocation. The trees are
only used to look up the sizes of huge allocations (> 8 MB on 32-bit
platforms, and > 256 MB on 64-bit platforms). Huge allocations start
at chunk boundaries, so prepending a tag would require an entire
extra page (and would cause potentially serious external
fragmentation on 32-bit systems).
Post by Johan Bucht
* Bucket sizes
If I didn't miss something you are using power of two sizes. Might be
better to use smaller sizes to avoid internal fragmentation, aswell as
improving locking granularity.
My implementation uses quantum-sized buckets (not power of two size
classes). The quantum size is either 8 or 16 bytes, depending on the
machine architecture.
Post by Johan Bucht
* Locking granularity
You use a single lock for the chunk allocation instead of a lock per
bucket size or something similar. This means that you limit access
unless threads allocate from their private arenas. Since you hash to get
an arena there might be multiple threads accessing the same arena at the
same time introducing contention. Might be better to have a lock per
allocation size to somewhat improve the situation.
Using a lock per allocation size requires that slower algorithms be
used in some places (and it certainly complicates certain other
operations). This didn't seem like a worthwhile tradeoff to me. By
making the code faster and simpler, less time is spent in the malloc
code. Unless an app does nothing but malloc/free, I don't expect
this to be an issue. Even microbenchmarks that do nothing but malloc/
free don't appear to suffer.
Post by Johan Bucht
* Locking primitive
The biggest issue and as David Xu pointed out is probably the locking
primitives. The SPINLOCK use has a limit in the threading library and
makes is hard to have a lot of mutexes. I ended up using a wrapper
around the umtx_lock function to get recursive mutexes and it would
probably be better to extend the umtx functions to handle recursion.
This would probably also be appreciated by other malloc
implementations.
Might be interesting to implement some of the ideas from the Linux futex
implementation to help umtx.
I have been contemplating creating a separate spinlock API that
doesn't require the threads library to track the spinlocks across
fork. This would (if I understand correctly) remove the current
static spinlock limitations.

As for supporting recursive spinlocks, I doubt that the overhead
would be acceptable in general. If I could get rid of the need for
the one recursive lock in malloc.c, I certainly would. =)
Post by Johan Bucht
* Big allocations
It might be preferable to use mmap/munmap directly for big allocations
(bigger than a few pages) to avoid the overhead of trying to be smart.
These big allocations should be pretty few and shouldn't be that
pessimized by this. It also means some memory savings as you can
directly free memory from big allocations.
I don't really see the approach I'm taking as "trying to be smart",
so much as "making it simple by using the same algorithm for almost
everything". Many other malloc implementations I've examined use
three or more allocation approaches, depending on the allocation
size. jemalloc uses only two, and huge allocations really are huge,
such that they rarely if ever happen. In practice, I don't think
there is a real memory savings to be had. Besides, madvise() calls
in strategic locations would achieve approximately the same result.
Post by Johan Bucht
* Ownership
Didn't look that much for it so might have missed it. Basicly what
happens if you free memory allocated in another arena, do you return
memory to the arena the memory was allocated from or to the current
threads arena? Not returning it to the original arena leads to cache
line sharing. Returning it to the original adds some overhead and it
might be argued that applications doing this might not be considered
scalable and should be fixed instead.
Allocated objects can be associated with their parent arenas in
constant time (by masking bits in the object addresses). This makes
it possible to always free memory back to the same arena it came from.
Post by Johan Bucht
* Standalone version
Would be nice to have a standalone version to test the allocator on
solaris and linux to compare against their allocators. Would be nice for
all the people with only access to SMP machines running another OS. I
tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
helpful in testing scalability and impact on small machines aswell as
finding bugs in the implementation. I tested my allocator against
Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
and almost non working cpu fan. Beer as in standing in 10cm of beer in
my backpack =).
Not sure how much of the implementation is FreeBSD dependant, RB-
trees?,
I actually have a separate more portable implementation of jemalloc
that is part of a programming language runtime library (jemalloc is
simply a FreeBSD-ized stand-alone version of that allocator). I've
tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've
compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that
version of the allocator has to make some performance sacrifices
regarding locking, since it doesn't have access to the pthreads
internals. As such, it requires a lot of care to interpret
performance results, so I haven't felt it to be worthwhile providing
that version here.
Post by Johan Bucht
* Focus - Single thread vs multithread
What are the focus of the allocator, how much memory is it worth to
waste to implement a good scalable implementation. This is the
reason I
think Solaris still keeps a serial allocator in libc and makes libumem
accessible through LD_PRELOAD and friends.
jemalloc only creates two arenas for single-threaded applications
(one for internal memory management, and one for user requests), so
it doesn't really sacrifice much in the case of single-threaded
programs -- a few pages maybe. In the case of multi-threaded
programs, it potentially sacrifices memory compactness in order to
reduce contention and cache line sharing.
Post by Johan Bucht
* Comment
Might be wise to fix the comment about always using mmap since brk can
be used.
Thanks, done.

Jason
Johan Bucht
2005-12-12 03:16:04 UTC
Permalink
Post by Jason Evans
Johan,
Thank you for your code review! Specific responses follow.
Wouldn't exactly call scrolling through a diff at 3 AM review =).
Post by Jason Evans
Post by Johan Bucht
Just a few comments, not sure if all of them are valid since I don't
have a recent enough system to apply the patch and test it, so my
comments are just based on reading the patch. I might have missed some
thing hidden between all the lines differing. Can you put up a patched
malloc.c somewhere?
http://www.canonware.com/~jasone/jemalloc/malloc.c
Thanks!
Post by Jason Evans
Post by Johan Bucht
* Fork safety functions
Nice to have for all allocators and is something I missed having. Would
like to have them regardless if your malloc becomes standard or not.
I think that the implementation is currently fork-safe. The threads
libraries call _malloc_prefork() and _malloc_postfork() in order to
avoid locking issues.
Hmm, I meant that the _malloc_prefork() functions are independent from
your malloc allocation and I would like them committed regardless as
they make the life easier for other malloc implementation.
Post by Jason Evans
Post by Johan Bucht
* magazines, per-arena caches
If I didn't miss something you don't seem to have it, pardon if I missed
them in my quick readings. They can really help the scalability and is
something that is implemented in Solaris libumem allocator with great
success. I implemented them and they helped bring the scalability up
quite a bit on the system I tested my allocator on aswell as bringing
the allocation latency down.
Arenas do use caches for objects below a size threshold (currently
~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).
The caching mechanism is quite different than magazines, but still
very effective.
Ah nice, shows how little review I did. =)
Will look a bit more at this then.
Post by Jason Evans
Post by Johan Bucht
* Saw you used a tree to look up the allocation sizes at free, this
* differs from how I and Solaris libumem does it. Instead I went for
* prepending a 8/16 byte tag containing the size of the allocation for
* lockless size lookup.
Actually, my implementation prepends a 4 byte tag that contains
allocated/free flags, and the size of the allocation. The trees are
only used to look up the sizes of huge allocations (> 8 MB on 32-bit
platforms, and > 256 MB on 64-bit platforms). Huge allocations start
at chunk boundaries, so prepending a tag would require an entire
extra page (and would cause potentially serious external
fragmentation on 32-bit systems).
Isn't 8 byte alignment expected by some applications?
How do you know if a allocation is huge if you don't have a tag?
Something more to read up on i guess. =)
Post by Jason Evans
Post by Johan Bucht
* Bucket sizes
If I didn't miss something you are using power of two sizes. Might be
better to use smaller sizes to avoid internal fragmentation, aswell as
improving locking granularity.
My implementation uses quantum-sized buckets (not power of two size
classes). The quantum size is either 8 or 16 bytes, depending on the
machine architecture.
Ahh sorry, just saw the pow stuff and assumed it was power of two for
the buckets.
Post by Jason Evans
Post by Johan Bucht
* Locking granularity
You use a single lock for the chunk allocation instead of a lock per
bucket size or something similar. This means that you limit access
unless threads allocate from their private arenas. Since you hash to
get
an arena there might be multiple threads accessing the same arena at
the
same time introducing contention. Might be better to have a lock per
allocation size to somewhat improve the situation.
Using a lock per allocation size requires that slower algorithms be
used in some places (and it certainly complicates certain other
operations). This didn't seem like a worthwhile tradeoff to me. By
making the code faster and simpler, less time is spent in the malloc
code. Unless an app does nothing but malloc/free, I don't expect
this to be an issue. Even microbenchmarks that do nothing but malloc/
free don't appear to suffer.
Yeah, not sure how much of an impact it has on your allocator it has,
just something that improved my allocator as I move magazines between
local arenas and the global.
Post by Jason Evans
Post by Johan Bucht
* Locking primitive
The biggest issue and as David Xu pointed out is probably the locking
primitives. The SPINLOCK use has a limit in the threading library and
makes is hard to have a lot of mutexes. I ended up using a wrapper
around the umtx_lock function to get recursive mutexes and it would
probably be better to extend the umtx functions to handle recursion.
This would probably also be appreciated by other malloc
implementations.
Might be interesting to implement some of the ideas from the Linux futex
implementation to help umtx.
I have been contemplating creating a separate spinlock API that
doesn't require the threads library to track the spinlocks across
fork. This would (if I understand correctly) remove the current
static spinlock limitations.
As for supporting recursive spinlocks, I doubt that the overhead
would be acceptable in general. If I could get rid of the need for
the one recursive lock in malloc.c, I certainly would. =)
Yeah, made some attempts to remove my recursive locks in the global
arena but it made the code more complicated and didn't really do much.
Post by Jason Evans
Post by Johan Bucht
* Big allocations
It might be preferable to use mmap/munmap directly for big allocations
(bigger than a few pages) to avoid the overhead of trying to be smart.
These big allocations should be pretty few and shouldn't be that
pessimized by this. It also means some memory savings as you can
directly free memory from big allocations.
I don't really see the approach I'm taking as "trying to be smart",
so much as "making it simple by using the same algorithm for almost
everything". Many other malloc implementations I've examined use
three or more allocation approaches, depending on the allocation
size. jemalloc uses only two, and huge allocations really are huge,
such that they rarely if ever happen. In practice, I don't think
there is a real memory savings to be had. Besides, madvise() calls
in strategic locations would achieve approximately the same result.
Yeah, saving them for future use through madvise might be better. I
chose the route to keep the big allocations simple and avoid caching and
looking up stuff for.
Basicly read the size and if it's big munmap it directly.
Post by Jason Evans
Post by Johan Bucht
* Ownership
Didn't look that much for it so might have missed it. Basicly what
happens if you free memory allocated in another arena, do you return
memory to the arena the memory was allocated from or to the current
threads arena? Not returning it to the original arena leads to cache
line sharing. Returning it to the original adds some overhead and it
might be argued that applications doing this might not be considered
scalable and should be fixed instead.
Allocated objects can be associated with their parent arenas in
constant time (by masking bits in the object addresses). This makes
it possible to always free memory back to the same arena it came from.
Ok, good.
Post by Jason Evans
Post by Johan Bucht
* Standalone version
Would be nice to have a standalone version to test the allocator on
solaris and linux to compare against their allocators. Would be nice
for
all the people with only access to SMP machines running another OS. I
tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
helpful in testing scalability and impact on small machines aswell as
finding bugs in the implementation. I tested my allocator against
Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
and almost non working cpu fan. Beer as in standing in 10cm of beer in
my backpack =).
Not sure how much of the implementation is FreeBSD dependant, RB-
trees?,
I actually have a separate more portable implementation of jemalloc
that is part of a programming language runtime library (jemalloc is
simply a FreeBSD-ized stand-alone version of that allocator). I've
tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris. On Linux I've
compared against ptmalloc2, dlmalloc, and hoard. Unfortunately, that
version of the allocator has to make some performance sacrifices
regarding locking, since it doesn't have access to the pthreads
internals. As such, it requires a lot of care to interpret
performance results, so I haven't felt it to be worthwhile providing
that version here.
ok
Post by Jason Evans
Post by Johan Bucht
* Focus - Single thread vs multithread
What are the focus of the allocator, how much memory is it worth to
waste to implement a good scalable implementation. This is the reason I
think Solaris still keeps a serial allocator in libc and makes libumem
accessible through LD_PRELOAD and friends.
jemalloc only creates two arenas for single-threaded applications
(one for internal memory management, and one for user requests), so
it doesn't really sacrifice much in the case of single-threaded
programs -- a few pages maybe. In the case of multi-threaded
programs, it potentially sacrifices memory compactness in order to
reduce contention and cache line sharing.
As I have not tested it I take your word. Mostly wanted to know where
you stand in the issue of pessimizing single-threaded apps vs MT. And if
it's worth keeping phkmalloc for some apps.
Post by Jason Evans
Post by Johan Bucht
* Comment
Might be wise to fix the comment about always using mmap since brk can
be used.
Thanks, done.
Jason
No problem, always fun to have some code that you actually understand
the idea behind and see how it's implemented.

/Johan Bucht
Johan Bucht
2005-12-12 03:22:41 UTC
Permalink
Post by Johan Bucht
Post by Jason Evans
Post by Johan Bucht
* Fork safety functions
Nice to have for all allocators and is something I missed having. Would
like to have them regardless if your malloc becomes standard or not.
I think that the implementation is currently fork-safe. The threads
libraries call _malloc_prefork() and _malloc_postfork() in order to
avoid locking issues.
Hmm, I meant that the _malloc_prefork() functions are independent from
your malloc allocation and I would like them committed regardless as
they make the life easier for other malloc implementation.
Bah, the libc bits not the actual functions of course. =)
Should probably go to sleep instead of writing semi-understandable
sentences.

/Johan Bucht
Jason Evans
2005-12-12 03:27:39 UTC
Permalink
Post by Johan Bucht
Post by Jason Evans
Post by Johan Bucht
* Saw you used a tree to look up the allocation sizes at free, this
* differs from how I and Solaris libumem does it. Instead I went for
* prepending a 8/16 byte tag containing the size of the
allocation for
* lockless size lookup.
Actually, my implementation prepends a 4 byte tag that contains
allocated/free flags, and the size of the allocation. The trees
are only used to look up the sizes of huge allocations (> 8 MB on
32-bit platforms, and > 256 MB on 64-bit platforms). Huge
allocations start at chunk boundaries, so prepending a tag would
require an entire extra page (and would cause potentially serious
external fragmentation on 32-bit systems).
Isn't 8 byte alignment expected by some applications?
Yes, 8 or 16 byte alignment is expected (in fact I'm of the opinion
that we should be using 16 byte alignment on i386 due to SSE2/SSE3).
If I understand your question correctly, you're asking how I get away
with 4 byte tags. Consider that (assuming 8-byte quantum) a malloc
(16) call must actually be serviced by at least 24 bytes internally,
in order to pad to the next quantum size. If I used 8 byte tags,
then malloc(17) would require 32 bytes internally. By using 4 byte
tags, malloc(13)..malloc(20) can be serviced by 24 bytes internally.
At least one implementation (the OS X malloc implementation) uses 2
byte tags, but this has two problems: 1) unaligned memory access is
slow on some architectures, and 2) it's too small to handle large
object sizes, so a different allocation strategy has to be used
starting at ~12 KB.
Post by Johan Bucht
How do you know if a allocation is huge if you don't have a tag?
I know an allocation is huge if its base address is chunk-aligned.
The actual size is stored in a red-black tree node, which is
allocated separately.

Jason
Johan Bucht
2005-12-12 03:48:40 UTC
Permalink
Post by Jason Evans
Post by Johan Bucht
Isn't 8 byte alignment expected by some applications?
Yes, 8 or 16 byte alignment is expected (in fact I'm of the opinion
that we should be using 16 byte alignment on i386 due to SSE2/SSE3).
If I understand your question correctly, you're asking how I get away
with 4 byte tags. Consider that (assuming 8-byte quantum) a malloc
(16) call must actually be serviced by at least 24 bytes internally,
in order to pad to the next quantum size. If I used 8 byte tags,
then malloc(17) would require 32 bytes internally. By using 4 byte
tags, malloc(13)..malloc(20) can be serviced by 24 bytes internally.
At least one implementation (the OS X malloc implementation) uses 2
byte tags, but this has two problems: 1) unaligned memory access is
slow on some architectures, and 2) it's too small to handle large
object sizes, so a different allocation strategy has to be used
starting at ~12 KB.
Well, I just wondered how you avoided unaligned accesses with a 4 byte tag.
Post by Jason Evans
Post by Johan Bucht
How do you know if a allocation is huge if you don't have a tag?
I know an allocation is huge if its base address is chunk-aligned.
The actual size is stored in a red-black tree node, which is
allocated separately.
Ok, expected it was through the address, thanks for answering anyway.
Gonna take some time reading through the code before asking more stupid
questions. =)

/Johan Bucht
Doug Rabson
2005-12-12 10:17:09 UTC
Permalink
Post by Johan Bucht
Isn't 8 byte alignment expected by some applications?
How do you know if a allocation is huge if you don't have a tag?
Something more to read up on i guess. =)
Actually, I'm glad you pointed that out. My own applications use SSE2
primitives a lot and those guys need to be allocated on 16-byte
boundaries. I currently use phkmalloc which has the nice property that
small allocations are aligned to the next greater power-of-2 boundary
which is (for me) always at least 16.

Since you are targetting modern architectures with this allocator, could
you please set the minimum alignment on all i386 and amd64 processors
to 16. For what its worth, when gcc is using SSE instructions, it
assumes this for all memory, stack or malloc.
Doug Barton
2005-12-12 02:58:40 UTC
Permalink
Post by Jason Evans
It would be very informative to run benchmarks with real world
multithreaded apps. bind9 (built with threading support) would be a
great candidate,
Maybe someday, but not at the moment. I've been told by the folks at ISC
that in it's current form, threading is a pessimization on all versions of
BIND 9 prior to the 9.4 code that they have in CVS (which has not been
generally released). Thus, I would not expect performance of BIND 9 with
threads to be an accurate reflection of the effectiveness of the library.

hth,

Doug
--
This .signature sanitized for your protection
Daniel Eischen
2005-12-12 05:58:25 UTC
Permalink
Post by Jason Evans
Post by Johan Bucht
* Locking primitive
The biggest issue and as David Xu pointed out is probably the locking
primitives. The SPINLOCK use has a limit in the threading library and
makes is hard to have a lot of mutexes. I ended up using a wrapper
around the umtx_lock function to get recursive mutexes and it would
probably be better to extend the umtx functions to handle recursion.
This would probably also be appreciated by other malloc
implementations.
Might be interesting to implement some of the ideas from the Linux
futex
implementation to help umtx.
I have been contemplating creating a separate spinlock API that
doesn't require the threads library to track the spinlocks across
fork. This would (if I understand correctly) remove the current
static spinlock limitations.
What about using pthread_atfork()?
Post by Jason Evans
As for supporting recursive spinlocks, I doubt that the overhead
would be acceptable in general. If I could get rid of the need for
the one recursive lock in malloc.c, I certainly would. =)
Why do we need a recursive mutex? Can you not restructure the
code so that it is not needed?
--
DE
Jason Evans
2005-12-12 06:28:30 UTC
Permalink
Post by Daniel Eischen
Post by Jason Evans
I have been contemplating creating a separate spinlock API that
doesn't require the threads library to track the spinlocks across
fork. This would (if I understand correctly) remove the current
static spinlock limitations.
What about using pthread_atfork()?
Aren't there potential ordering issues for that? It seems to me that
the malloc pre-fork code would need to be run after any other pre-
fork functions, in order to avoid potential deadlock, and that the
malloc post-fork code would need to be run before any other post-fork
functions, again to avoid potential deadlock.

After looking at the spinlock code some more, it's no longer clear to
me why the thread/thr_spinlock.c code uses a static array for
spinlocks. It seems to me that it would work fine to allow the
program to provide space for a spinlock and manually initialize it.
This would remove the limitation on the number of spinlocks.
Post by Daniel Eischen
Post by Jason Evans
As for supporting recursive spinlocks, I doubt that the overhead
would be acceptable in general. If I could get rid of the need for
the one recursive lock in malloc.c, I certainly would. =)
Why do we need a recursive mutex? Can you not restructure the
code so that it is not needed?
There is an internal arena that the malloc code uses for allocating
internal data structures. In some cases, the internal arena has to
recursively allocate. If there were no object caching, it might be
possible to pre-allocate, such that recursion never happens, but
given the object caching, it's difficult to reason about precisely
what will occur internally for a single malloc/free operation. There
are some other possibilities, but nothing I've thought of so far is
simple or elegant.

Fixing this would make all locking in the malloc code a bit cheaper,
which is why it continues to bother me.

Jason
Daniel Eischen
2005-12-12 15:40:22 UTC
Permalink
Post by Jason Evans
Post by Daniel Eischen
Post by Jason Evans
I have been contemplating creating a separate spinlock API that
doesn't require the threads library to track the spinlocks across
fork. This would (if I understand correctly) remove the current
static spinlock limitations.
What about using pthread_atfork()?
Aren't there potential ordering issues for that? It seems to me that
the malloc pre-fork code would need to be run after any other pre-
fork functions, in order to avoid potential deadlock, and that the
malloc post-fork code would need to be run before any other post-fork
functions, again to avoid potential deadlock.
After looking at the spinlock code some more, it's no longer clear to
me why the thread/thr_spinlock.c code uses a static array for
spinlocks. It seems to me that it would work fine to allow the
program to provide space for a spinlock and manually initialize it.
This would remove the limitation on the number of spinlocks.
We really want to deprecate the use of spinlocks in libc, so it
is just a band-aid until we change our mutex types to be full
structs instead of pointers to them that have to get allocated.
When that happens, the malloc implementation should be changed
to use mutexes since they will not have to be allocated, and, for
the uncontested case, be just an instruction or two (very much
like umtx, possibly the same).

When the imminent symbol versioning hits the tree, it will
be much easier to make the mutex change since it will not
affect everything in the world that uses mutexes, CVs, etc.
We just need to decide on a layout for them so all 3 thread
libraries agree, at least on size and on what FOO_INITIALIZERs
do.
Post by Jason Evans
Post by Daniel Eischen
Post by Jason Evans
As for supporting recursive spinlocks, I doubt that the overhead
would be acceptable in general. If I could get rid of the need for
the one recursive lock in malloc.c, I certainly would. =)
Why do we need a recursive mutex? Can you not restructure the
code so that it is not needed?
There is an internal arena that the malloc code uses for allocating
internal data structures. In some cases, the internal arena has to
recursively allocate. If there were no object caching, it might be
possible to pre-allocate, such that recursion never happens, but
given the object caching, it's difficult to reason about precisely
what will occur internally for a single malloc/free operation. There
are some other possibilities, but nothing I've thought of so far is
simple or elegant.
Well, just lock around the external functions and remove all locking
from the internal and recursive functions. Can't all recursion
be replaced with loops anyways ;-)
Post by Jason Evans
Fixing this would make all locking in the malloc code a bit cheaper,
which is why it continues to bother me.
--
DE
Johan Bucht
2005-12-12 16:20:29 UTC
Permalink
Post by Daniel Eischen
Post by Jason Evans
Post by Daniel Eischen
Post by Jason Evans
As for supporting recursive spinlocks, I doubt that the overhead
would be acceptable in general. If I could get rid of the need for
the one recursive lock in malloc.c, I certainly would. =)
Why do we need a recursive mutex? Can you not restructure the
code so that it is not needed?
There is an internal arena that the malloc code uses for allocating
internal data structures. In some cases, the internal arena has to
recursively allocate. If there were no object caching, it might be
possible to pre-allocate, such that recursion never happens, but
given the object caching, it's difficult to reason about precisely
what will occur internally for a single malloc/free operation. There
are some other possibilities, but nothing I've thought of so far is
simple or elegant.
Well, just lock around the external functions and remove all locking
from the internal and recursive functions. Can't all recursion
be replaced with loops anyways ;-)
Simple description of the issues following.

First ideal case, no recursive locking needed:

1. Lock the local arena to allocate memory
2. If we need to allocate internal data structures lock the base arena
2.1. Allocate memory from the base arena
2.2. Unlock base arena
3. Unlock local arena

The ideal case assumes that the base arena have everything it needs and
don't have to set up internal data structures to handle the allocations.

Actual case:

0. From malloc(3) friends get a local arena and pass that into the
internal functions
1. Lock the supplied arena
2. If we need to allocate internal data structure goto step 1, using the
base arena instead of one of the local arenas
2.1 Allocate memory from base arena
2.2 Unlock base arena
3. Unlock arena

This means that we can lock the base arena recursivly, should the
allocation of memory from the base arena need to allocate memory.
Solving this would need some special handling for the base arena (avoid
using the same interface as the other arenas or statically allocate the
base arena).

/Johan Bucht

Continue reading on narkive:
Loading...