Parallelizing Fetches
Joe Gregorio
joe at bitworking.org
Sat Nov 4 04:26:34 EST 2006
On 11/3/06, Harry Fuecks <hfuecks at gmail.com> wrote:
> > Joe Gregorio is looking into parallelizing the HTTP fetches; taking a
> > single feed and processing it multiple times is also an obvious
> > optimization. But for now, yes, that would mean multiple fetches.
>
> I guess Joe is already a long way into this but perhaps this is
> interesting anyway;
>
> http://blog.bitflux.ch/archive/2005/10/28/how-to-fetch-a-lot-of-feeds.html
Here is the branch I have been working on:
http://bitworking.org/projects/venus/branches/threaded/
This branch includes httplib2 to handle the fetching. I have added a
new config option 'spider_threads' that you can set to the number of
threads you want to use when spidering. The default is 0. When
spider_threads is set to zero httplib2 is not used and feedparser
is used to fetch the feeds. Note that the threading only applies
to HTTP(S) URIs, all other URI types are done in the main thread
and handled by feedparser. All parsing is also handled only in the main
thread.
The caching in httplib2 is used and is stored as 'http' under the sources
cache directory.
The status of the code is 'under testing'. All of the current unit tests
pass and I have run it successfully over my configurations but it
definitely needs more testing, and it needs more unit tests.
Some preliminary numbers for ~60 feeds:
config.spider_threads = 0 1m40s
config.spider_threads = 10 30s
I did run a special test where I
remove line 375 of planet/spider.py
where the call to spiderFeed() is skipped if
the response from httplib2 is from the cache.
If that line is removed, forcing a
call to spiderFeed() for each and every
feed then the timing is:
config.spider_threads = 0 1m40s
Of course, this is just the beginning for multi-threaded spidering. What
should be added after this is stable is some code to make the spidering
'nice'. That is, the code should look up IP addresses and avoid hitting
the same server more than once every N seconds.
Thanks,
-joe
--
Joe Gregorio http://bitworking.org
More information about the devel
mailing list