P1394> Revised PWG1394 Cmd Set

Fri Aug 7 06:14:31 EDT 1998

Greg,

I think reading AGENT_STATE register (may be periodically) is very
similar to heart-beat, isn't it?

Because the error is (accidentally) occurred during the
communication between only two nodes, I think bus reset may be too
sacrificed for entire bus as a first recovery step.

Am I missing something?

Akihiro Shimura

On Thu, 6 Aug 1998 09:45:09 -0700 (PDT)
Greg Shue <gregs at sdd.hp.com> wrote:

> 
> Fumio Nagasaka wrote:
> > > In my idea, a print spooler in the initiator shall detect time out.
> > 
> > OK, forget my brain damaged idea. But still we don’t have cool answers.
> > Any session layer beyond SBP-2 can detect this error. Then what shall we do?
> > Examining ORB_POINTR requires an initiator to do effort as same as
> > issuing a transport command. Doesn’t it?
> 
> I wouldn't call it "brain damaged", just focused on a narrow
> application space.  :-)
> 
> Issuing a transport command is much more intrusive to the system
> than examining the AGENT_STATE and ORB_POINTER registers, because
> it now requires it's own sequence number, and it can't be put on
> the queue until the condition is corrected!
> 
> So, just for review:
> 
>   - Any identified errors other than a lost Ack on a status FIFO write
>     either have a retry alogrithm described in SBP-2, or cause
>     a status FIFO write to report the error.
> 
>   - All identified errors cause the Fetch Engine to transition to
>     the DEAD state, and cause the entire task set to be cleared.
> 
>   - Because we are dealing with parallel-execution rather than
>     just out-of-order, we must clarify what value ends up in the
>     ORB_POINTER register when the Fetch Engine is in the DEAD state.
> 
>   ? The process used by an initiator to recover when it finds
>     the Fetch Engine in the DEAD state without a cause written to
>     the status FIFO is to construct a linked list of the pending
>     ORBs and queue it up on the Task List.
> 
>   - Because the PWG Transport Requirements identify the transport as
>     "reliable" and "Connection-oriented" and the undocumented transport
>     model is TCP, the following options come to mind:
> 
>       1) The stall detection could be left up to the Transport
> 	 Client.  Any clients which do not provide this detection
> 	 (via a timeout) would get hung in this condition until
> 	 something caused a Bus Reset.
> 
>       2) The stall detection could be handled by the initiating
> 	 node's PWG protocol layer.  It could either be done by:
>            - a timeout (negotiated reconnect interval?) on target
>              activity, causing the initiator to look at the
>              AGENT_STATE register.
>            - the target's node generating a Bus Reset
>            - ???
> 
> I believe strongly that this detection should NOT be forced on
> the Transport Client or above.  I also believe that human
> intervention must not be required to recover from this condition.
> Any target inactivity detection must occur within a
> target-dependent timeout, because different functions and
> technologies have different constraints.  (e.g.  ink jet dry
> times, scanner bulb color stability, media sitting over fusers.)
> 
> Can somebody model or better yet measure the frequency of this error?
> If it really is extreemely low, then let's just let the target
> node generate a Bus Reset.  Otherwise let's put in a heart-beat
> write to the Status FIFO using Unsolicited Status.
> 
> I think we need to choose one at the next meeting and be done
> with it.  We're only looking for a solution for the first
> revision of this protocol.
> 
> -- 
> Greg Shue
> Hewlett-Packard Company
> All-in-One Division			        gregs at sdd.hp.com
> ----------------------------------------------------------------
> 

--
 Akihiro Shimura (shimura at pure.cpdc.canon.co.jp)
 Office Imaging System Promotion Project
 CANON INC.