parallel test failures

classic Classic list List threaded Threaded
7 messages Options
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

parallel test failures


I have intermittent failures when running the test suite on sufficiently
parallel machines.  I have attached a log of such a failing build,
although it does not seem especially illuminating.

It takes anywhere from 5 to 300 runs to get a failure for me running on
60 hardware threads (30 cores). At least on this machine the number of
tests that pass seems consistent at 1205


_______________________________________________
notmuch mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]

log.xz (24K) Download Attachment
Xu Wang Xu Wang
Reply | Threaded
Open this post in threaded view
|

Re: parallel test failures

I did not look at logs, but I have had problem in other scenarios. The
way I debugged was to use strace to get a list of all files the tests
accessed. From that list I could recognize that some files that should
have been in separate temp directories were not thread-specific and
solution was to put the temp files in separate dir for each test. Not
sure if this is helpful, but wanted to share.

Kind regards and best of luck,

Xu

On Fri, Feb 19, 2021 at 7:24 AM David Bremner <[hidden email]> wrote:

>
>
> I have intermittent failures when running the test suite on sufficiently
> parallel machines.  I have attached a log of such a failing build,
> although it does not seem especially illuminating.
>
> It takes anywhere from 5 to 300 runs to get a failure for me running on
> 60 hardware threads (30 cores). At least on this machine the number of
> tests that pass seems consistent at 1205
>
> _______________________________________________
> notmuch mailing list -- [hidden email]
> To unsubscribe send an email to [hidden email]
_______________________________________________
notmuch mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
Tomi Ollila-2 Tomi Ollila-2
Reply | Threaded
Open this post in threaded view
|

Re: parallel test failures

In reply to this post by David Bremner-2
On Fri, Feb 19 2021, David Bremner wrote:

> I have intermittent failures when running the test suite on sufficiently
> parallel machines.  I have attached a log of such a failing build,
> although it does not seem especially illuminating.
>
> It takes anywhere from 5 to 300 runs to get a failure for me running on
> 60 hardware threads (30 cores). At least on this machine the number of
> tests that pass seems consistent at 1205

I did the following changes to see file write accesses:

----
diff --git a/test/notmuch-test b/test/notmuch-test
index b58fd3b3..903a5dff 100755
--- a/test/notmuch-test
+++ b/test/notmuch-test
@@ -62,13 +62,16 @@ if test -z "$NOTMUCH_TEST_SERIALIZE" && command -v
parallel >/dev/null ; then
         META_FAILURE="parallel test suite returned error code $RES"
     fi
 else
+    rm -rf inw; mkdir inw
     for test in $TESTS; do
+        testname=$(basename $test .sh)
+        inotifywait -d --outfile $PWD/inw/inw-$testname -r -e close_write,delete $PWD/test /tmp
         $TEST_TIMEOUT_CMD $test "$@" &
         wait $!
+        pkill inotifywa
         # If the test failed without producing results, then it aborted,
         # so we should abort, too.
         RES=$?
-        testname=$(basename $test .sh)
         if [[ $RES != 0 && ! -e
         "$NOTMUCH_BUILDDIR/test/test-results/$testname" ]]; then
             META_FAILURE="Aborting on $testname (returned $RES)"
             break
----

Then ran tests w/ NOTMUCH_TEST_SERIALIZE=t

and then ran

for f in inw/*; do echo $f; sed -e 's,.*notmuch/test/,  ,' -e '/tmp.T/ s,/.*,,' $f | sort -u; echo; done | less

to examine "fallout"

based on that (random gazes to the listing) I did not see any potentially
overlapping writes, but saw unrelated inconsistency in test directories.

Anyway, the log.gz did not show any tests failing but parallel exiting
nonzero possibly for some other reason. Cannot say. Probably stracing (even
with --seccomp-bpf) would make it happen even less likely :/

Tomi
_______________________________________________
notmuch mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: parallel test failures

Tomi Ollila <[hidden email]> writes:

>
> Anyway, the log.gz did not show any tests failing but parallel exiting
> nonzero possibly for some other reason. Cannot say. Probably stracing (even
> with --seccomp-bpf) would make it happen even less likely :/
>

Thanks to both of you for your feedback / suggestions. I did read today
that timeout exits with 124 when the time limit is reached. I haven't
investigated further (nor do I know how the timelimit should be reached,
since the whold build+test cycle takes about 10s on this machine.

d
_______________________________________________
notmuch mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: parallel test failures

David Bremner <[hidden email]> writes:

> Tomi Ollila <[hidden email]> writes:
>
>>
>> Anyway, the log.gz did not show any tests failing but parallel exiting
>> nonzero possibly for some other reason. Cannot say. Probably stracing (even
>> with --seccomp-bpf) would make it happen even less likely :/
>>
>
> Thanks to both of you for your feedback / suggestions. I did read today
> that timeout exits with 124 when the time limit is reached. I haven't
> investigated further (nor do I know how the timelimit should be reached,
> since the whold build+test cycle takes about 10s on this machine.

Maybe a timeout is not so crazy. I ran a couple of trials with
NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110
repetitions) in T355-smime, as far as I can tell on the first test.
I'm currently running some trials to see if I can duplicate that without
parallel execution, but that of course takes longer.

d
_______________________________________________
notmuch mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
Tomi Ollila-2 Tomi Ollila-2
Reply | Threaded
Open this post in threaded view
|

Re: parallel test failures

On Fri, Feb 26 2021, David Bremner wrote:

> David Bremner <[hidden email]> writes:
>
>>
>> Thanks to both of you for your feedback / suggestions. I did read today
>> that timeout exits with 124 when the time limit is reached. I haven't
>> investigated further (nor do I know how the timelimit should be reached,
>> since the whold build+test cycle takes about 10s on this machine.
>
> Maybe a timeout is not so crazy. I ran a couple of trials with
> NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110
> repetitions) in T355-smime, as far as I can tell on the first test.
> I'm currently running some trials to see if I can duplicate that without
> parallel execution, but that of course takes longer.

So, AFAIU, you got 124 since timeout(1) exited with that status (and
killed all parallel(1) executions (after 2 minutes in that case?)...
... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not
executed and a test hung (probably T355-smime).

In any way you get it again to hung state (w/o using timeout(1) to
mess around) you probably can peek things with ps, /proc, strace,
gdb, or with some other (potentially more sophisticated ;) tools.

>
> d

Tomi
_______________________________________________
notmuch mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: parallel test failures

Tomi Ollila <[hidden email]> writes:

> So, AFAIU, you got 124 since timeout(1) exited with that status (and
> killed all parallel(1) executions (after 2 minutes in that case?)...
> ... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not
> executed and a test hung (probably T355-smime).

That sounds right.

> In any way you get it again to hung state (w/o using timeout(1) to
> mess around) you probably can peek things with ps, /proc, strace,
> gdb, or with some other (potentially more sophisticated ;) tools.

In fact it looks like I already reported this issue (or a different
issue causing T355 to hang, which seems less likely) at

   id:[hidden email]

Past me seems to have thought it was some kind of gpgsm failure. I would
welcome input from people use or understand gpgsm.

d
_______________________________________________
notmuch mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]