IFX> PDF/is Issue.

Wed Mar 12 10:48:46 EST 2003

Of course, if you had to read in reverse order starting from the end of 
the file, that would have performance implications, too.  It would be 
impossible for the reader to start generating output until the writer had 
finished generating and transmitting the document.

        -Carl

"Poysa, Kari" <Kari.Poysa at usa.xerox.com>
03/12/2003 07:04 AM

        To:     "Hastings, Tom N" <hastings at cp10.es.xerox.com>, "'Rick Seeler'" 
<rseeler at adobe.com>, Carl Kugler/Boulder/IBM at IBMUS
        cc:     ifx at pwg.org
        Subject:        RE: IFX> PDF/is Issue.

Tom, The Length being discussed here actually is the byte count of the 
streams of Image XObjects that belong to the Page.  So if the Page is 
comprised of more than one image (a.k.a banding), then the sender does not 
need to cache even a full page's worth of compressed data in order to be 
able to write the Image XObject's stream length in the stream dictionary.

Full PDF allows the writer to enter an indirect object reference into the 
required Length entry. This makes it easy to implement writers because the 
separate object for the length can be written after all of the image data 
has been written. The PDF files are then read in the reverse order 
starting from the end of the file. This works well if one has a file 
system to store the complete PDF file.  So requiring the Length to be a 
direct value in the stream dictionary most likely would cause existing 
writer SW to have to be modified.  One could not keep writing the same 
kind of files and claim them PDF/is compliant.

    --- Kari ---
-----Original Message-----
From: Hastings, Tom N 
Sent: Tuesday, March 11, 2003 5:49 PM
To: Poysa, Kari; 'Rick Seeler'; 'Carl Kugler'
Cc: ifx at pwg.org
Subject: RE: IFX> PDF/is Issue.

Kari,

I think you summed up the argument about tradeoff simply between the 
Sender and the Receiver when you said:

"If we require the reader to be able to cache a page's worth of 
uncompressed data, surely we can require the writer to cache a page's 
worth of compressed data [in order to determine the length and send that 
length in the stream]."

I assume that PDF has the notion of a length for each page, right?  So we 
require that the Sender put in a length field for each page of data at the 
front of each page of data.  Can that length field be sent with the data 
in some manner, so that the Sender doesn't have to know the lengths of all 
of the pages before sending any?

Tom
-----Original Message-----
From: Poysa, Kari [mailto:Kari.Poysa at usa.xerox.com]
Sent: Friday, March 07, 2003 15:04
To: 'Rick Seeler'; 'Carl Kugler'
Cc: ifx at pwg.org
Subject: RE: IFX> PDF/is Issue.

Rick, I bet this solution can be implemented, but it does have some 
problems for the reader that unfortunately I did not see earlier. The 
difficulty really is whether we want to make life easy for the streaming 
writer or the reader. 

If the length follows the image stream, the reader must scan the filtered 
stream to find the end of the stream. This can make the reader 
implementation both cumbersome and slow, especially if the stream has to 
be fully decoded during the PDF file parsing, instead of simply extracting 
the correct amount of binary data and passing it to a separate 
decompression module. The PDF file parser would have to know details of 
the compressed streams which should really be of no interest to the PDF 
file parser module and makes creating applications from 3rd party 
components harder.

In addition, if the reader attempts to decode the stream, how much data 
should be cached and decoded at a time? If the end of stream is not found 
at first attempt, one has to pass additional data to the decoder and 
continue decoding from where previous data ended. This can delay achieving 
robust implementations. The alternative, searching for the "endstream" 
text, is not 100% reliable (although very close) and is a wasted step 
since no decompression is achieved yet.

This issue is really at the heart of what "streamable" means, and also has 
a big impact on what kind of low resource applications PDF/is can be used 
for. I think we should consider it a "MUST" for the writer to prefix the 
stream with its length, since the goal is to make the file format 
streamable especially at a low resource reader. If we require the reader 
to be able to cache a page's worth of uncompressed data, surely we can 
require the writer to cache a page's worth of compressed data.

I do understand Ira McDonalds note about streaming writers (see separate 
Email). Possibly this issue whether to prefix or postfix image streams 
with their lengths should be a negotiable capability between the sender 
and receiver?

    --- Kari ---
-----Original Message-----
From: Rick Seeler [mailto:rseeler at adobe.com]
Sent: Thursday, March 06, 2003 2:37 PM
To: 'Poysa, Kari'; 'Carl Kugler'
Cc: ifx at pwg.org
Subject: RE: IFX> PDF/is Issue.

Kari,

Yes, the stream length should precede the stream, if possible (this is 
allowed).  But, in the case where the stream may be long, this may not be 
possible for the Producer.  In that case, the length should be an indirect 
object reference to the length that should come immediately after the 
stream.

As for your idea of scanning for "endstream" that's followed by the size 
object.  This still has the same problem as scanning for "endstream" but 
just has more data and a smaller likelihood of occurrence.

Given that, and what I discussed in my previous e-mail on this subject (to 
Rob Buckley), I think the best approach might be to:
1) The Producer MUST always write the stream length of all 'Content 
Streams' and 'ICC Profile' streams immediately in the object dictionary 
(before the stream).
2) When writing image streams, the Producer MAY either write the stream 
length before or after the stream, as they prefer.
3) When an image stream is length succeeded (indirect object), the 
Consumer SHOULD decode image streams to determine the stream length, when 
possible.  But, the Consumer MAY (at their peril) scan for the 'endstream' 
marker.

How does this sound as a solution?

-Rick
-----Original Message-----
From: owner-ifx at pwg.org [mailto:owner-ifx at pwg.org] On Behalf Of Poysa, Kari
Sent: Thursday, March 06, 2003 7:15 AM
To: 'Carl Kugler'
Cc: ifx at pwg.org
Subject: RE: IFX> PDF/is Issue.

In my opinion the goal should be to write the stream length immediately to 
the stream dictionary. 

Also, the likelihood of "endofstream" to exists in the data is small.  We 
could also require that if a low resource streaming writer is not able to 
add the length directly into the stream directory, then the PDF object for 
the length MUST immediately follow the stream object. This way, the reader 
can scan for "endofstream" (but of course only if the length was not in 
the stream dictionary) and make sure that it is the correct "endofstream" 
by verifying that it is immediately followed by something that looks like 
a length object. Could reader implementers comment on this?

I think introducing an additional filter like ASCII85 just for spotting 
the end of stream adds unnecessary complexity to both writer and reader, 
increases file sizes and also requires more memory and processing as the 
stream cannot be passed directly to a decompressor.

    --- Kari ---
-----Original Message-----
From: Carl Kugler [mailto:kugler at us.ibm.com]
Sent: Wednesday, March 05, 2003 10:50 AM
Cc: ifx at pwg.org
Subject: RE: IFX> PDF/is Issue.

I like the chunking approach.  It is efficient, reliable, and has low 
overhead for reasonably sized chunks.  Also fits well in a typical 
implementation that writes a chunk of data at a time. 

        -Carl 

"Zehler, Peter" <PZehler at crt.xerox.com> 
Sent by: owner-ifx at pwg.org 
03/05/2003 05:00 AM 

        To:        "'Rick Seeler'" <rseeler at adobe.com>, ifx at pwg.org 
        cc:         
        Subject:        RE: IFX> PDF/is Issue. 

Rick, 
Why not just increase the size of the length field signature?  Could this 
be done by the addition of data or comments in the length object or by 
adding another object?  I don't know pdf very well.  I don't think we need 
0% probability of confusion just a statistically insignificant chance. 
Pete 

Peter Zehler 
XEROX 
Xerox Architecture Center 
Email: PZehler at crt.xerox.com 
Voice:    (585) 265-8755 
FAX:      (585) 265-8871 
US Mail: Peter Zehler 
        Xerox Corp. 
       800 Phillips Rd. 
       M/S 128-30E 
       Webster NY, 14580-9701 
-----Original Message-----
From: Rick Seeler [mailto:rseeler at adobe.com]
Sent: Tuesday, March 04, 2003 1:29 PM
To: ifx at pwg.org
Subject: IFX> PDF/is Issue.

During prototyping of PDF/is the following problem arose: 

How does the Consumer know when the end of a data stream (See section 
3.2.7 of [pdf]) is reached?  Normally, in a PDF, the Consumer would 
consult the stream length field.  The problem here is where to put the 
length field.  If the length were placed before the stream, the Consumer 
would know how long the stream is. This requires the Producer to know the 
stream's length before writing it to the Consumer.  If, instead, the 
length were written at the end of the stream, this would solve the 
Producer's problem but the Consumer would not know how to find the length 
since they can't identify, 100% of the time, where the stream ends and 
where the length object is. 

An example will illustrate: 
First, the normal case... 

stream 
sdljfiwefnwfubrevurewliysnhr;hgawebfz;h;uwre (lots of binary data 
here).... 
84trhdvfyu7wgf4.nbdrgur4uaru4gb 
endstream 
12 0 obj 
3456    <- the length of the previous stream. 
endobj 

But, what if the data looked like this... 

stream 
sdljfiwefnwfubrevurewliysnhr;hgawebfz;h;uwre (lots of binary data 
here).... 
endstream            <- the binary data could have a string of bytes that 
looked like this. 
84trhdvfyu7wgf4.nbdrgur4uaru4gb 
endstream 
12 0 obj 
4567    <- the length of the previous stream. 
endobj 

Of course, you could look to bytes after the appearance of the word 
'endstream' to see if this is really the end of the stream; but you can 
always come up with a stream that could match your parsing algorithm's 
expectations (although with decreasing percentage of occurrence). 

Possible solutions: 
1) Write all data using ASCII85 encoding (See Section 3.3.2 of [pdf]). 
This will increase stream lengths by 25%.  ASCII85 has a stream delimiter 
which would solve this problem -- the end of the stream can be known for 
certain and the length field can be placed after the stream. 
2) Require the Producer to write the stream length before any stream (the 
streams would stay binary).  The Producer can use banding to break up 
large images into small enough chunks so the Producer can cache the stream 
before sending. 
3) Offer a combination of 1 & 2.  The Producer would cache streams if 
possible, but may use ASCII85, if necessary. 
4) Producer must make certain all streams must not contain a series of 
bytes "\0D\0Aendstream" in the stream data.  This is how the spec is 
defined currently -- but this may be too onerous for the Producer. 

Any other ideas?  I'm personally leaning toward solution #3. 

-Rick 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.pwg.org/archives/ifx/attachments/20030312/df49e3c6/attachment-0001.html