New venus showing only random selection of posts [runtests.py errors]
rubys at intertwingly.net
Sat May 22 04:14:48 EST 2010
On 05/21/2010 01:20 AM, Mikael Nilsson wrote:
> Spontaneously, the 304 is suspicious:
> it could very well explain why it works the first time and not the second.
> Grepping through the sources it does seem *all shown sources* do NOT
> have the 304 line, while all not shown have it. Bingo!
Just to give an update. What 304 means is that the content hasn't
changed. This will often be the case if you run Venus again immediately
after you have just run it.
Not all servers return this status, some simply return the same content
again, wasting their and your bandwidth, and requiring Venus to
reprocess the same content.
Clearly somewhere the link information is being lost in your runs. You
see it in the debug messages below. And apparently in the case of a
unchanged feed, it gets written to the cache.
This information is important. It later is matched up with the data in
the cache. If in processing the page it encounters data in the cache
that doesn't match any source, Venus presumes that you have unsubscribed
from that feed, and removes it from the list.
While I do have an understanding that is consistent with what you are
seeing, what I have failed to do so far is reproduce your results. I've
tried repeatedly with the exact feeds you mention, and I'm not seeing
the problem, but I am continuing to pursue it.
Meanwhile, mbrubeck fixed a bug in link checking that would cause links
to be lost:
I've pushed that fix out to my git of Venus. If you could try it and
report back, I would appreciate it.
- Sam Ruby
> tor 2010-05-20 klockan 18:19 -0400 skrev Sam Ruby:
>> On 05/20/2010 05:39 PM, Mikael Nilsson wrote:
>>> Right, so with your latest commit the "links" error is gone, replaced by
>>> *lots* of debug prints like
>>> DEBUG:planet.runner:missing self link for http://magnihasa.blogspot.com/feeds/posts/default/-/Piratrelaterat?alt=rss
>>> DEBUG:planet.runner:missing html link for http://magnihasa.blogspot.com/feeds/posts/default/-/Piratrelaterat?alt=rss
>>> DEBUG:planet.runner:missing self link for http://christianengstrom.wordpress.com/feed/
>>> DEBUG:planet.runner:missing html link for http://christianengstrom.wordpress.com/feed/
>>> DEBUG:planet.runner:missing self link for http://ershag.se/feed/
>>> DEBUG:planet.runner:missing html link for http://ershag.se/feed/
>>> which seem to coincide with the sources not being shown.
>>> The corresponding feeds do exist on disk, even the individual posts are
>>> generated in the cache directory. Something is wrong in the reloading of
>>> the posts.
>> Inside the cache directory, there is a subdirectory named sources.
>> There should be one file there per subscription. Can you visually
>> compare one of the ones that is skipped against one that matches a
>> source that is being shown to see if there is any obvious difference?
>> Feel free to post (or send me directly) one of each (these files are
>> typically only a kilo-byte or two).
>> Also, in the output is there any output that starts with the word
>> - Sam Ruby
More information about the devel