SV: Re: venus -n option

Antonio Eggberg antonio_eggberg at yahoo.se
Tue Mar 27 03:24:15 EST 2007


--- Morten Høybye Frederiksen <morten at wasab.dk> skrev:

> Hi,
> 
> On 3/25/07, Sam Ruby <rubys at intertwingly.net> wrote:
> > My general advice is to treat the cache as the resource, not the xml.
> > If, for example, you had a program which built a small database of file
> > names and hashes of the content for each, subsequent runs would be able
> > to tell which files are new and which files had changed.
> FWIW: For my WordPress plugin [1] I simply record the modification
> time and size of each file in the cache, and on subsequent runs only
> reparse the files that are changed. Since I don't check the content, I
> might miss out on some updates, but it's a lot faster not having to
> load each entry.

ok. But it only works with wp :-) or did I miss something.. This could
solve my problem is it available as OS or?? For me parsing takes about
half of the time. 

On a different issue I wonder if the following thought has been given 
in terms of feed crawling. i.e Adaptive crawling... In my use case 
approximately 70% of the blog doesn't get updated once a day how ever
I have some regular daily news site which gets updated by the hour. So
I don't want to visit the 70% of the sites every time. My thought is
that imagine you crawl every hour so..

hour 1 : crawl a feed --> No update 
hour 2 : crawl again the same feed --> no update
push the crawl with +2 hours (a config option)
hour 5 : crawl again --> no update
push the crawl another +2 hours (so it will be crawled 4 hours from now)

Same goes for feeds that gets update often but opposite direction.. you 
could also have a general check option --> yes which will check all 
feeds but that could be once a month user activated..
I think this would minimize network load..

is there anything like that .. the option activity_threshold is bit blurry to 
me. Does those feeds never crawled again.. no?

Just some thoughts..

> 
> [1] http://www.wasab.dk/morten/blog/archives/2006/10/22/wp-venus
> 
> 
> Regards,
> Morten
> 



	
	
		
_________________________________________________________
Flyger tiden iväg? Fånga dagen med Yahoo! Mails inbyggda
kalender. Dessutom 250 MB gratis, virusscanning och antispam. Få den på: http://se.mail.yahoo.com


More information about the devel mailing list