Using regexp_filter (venus)
rubys at intertwingly.net
Thu Mar 15 00:59:03 EST 2007
Amit Chakradeo (अमित चक्रदेव) wrote:
> On 3/13/07, *Sam Ruby* <rubys at intertwingly.net
> <mailto:rubys at intertwingly.net>> wrote:
> Amit Chakradeo (अमित चक्रदेव) wrote:
> > Hi,
> > I am trying to filter out some items which have the string
> > GolfNow.com.
> > I tried using the following ways, but both of them did not work
> (I still
> > see articles containing the GolfNow.com string!)
> > filters= regexp_sifter.py?exclude=GolfNow\.com
> > filters= regexp_sifter.py?--exclude=GolfNow\.com
> > But if I pass in the options on command line it seems to work:
> > cat cache_item_file | python regexp_sifter.py --exclude GolfNow\.com
> > (no output which is good)
> > ???
> I ran some tests, and the code seems to be working.
> Filters are applied when spidering, adding a filter later won't remove
> anything from the cache.
> Could that be the point of confusion? If so, that can be fixed for most
> cases.... spider could be modified to actively delete cache entries
> which have been filtered. This will only remove entries which are
> actually present in a feed, and only for feeds that actually change.
> Yes, that was the point of confusion. Thanks. Actually I just looked at
> the architecture picture and it is very clear. For now I just deleted
> the offending entries from cache and will check if it gets filtered. I
> was just running splice.py with the hope that the filters will be applied.
> Maybe the filters should have some configuration setting telling when to
> apply them... Though I can't imagine a use-case in which this would be
I improved the documentation to make it clearer: see first note on the
filters page. I also added the code (along with a test) to ensure that
filters, when run, will actively clean out old entries in the cache.
While that doesn't cover all the cases, it might help...
- Sam Ruby
More information about the devel