Using regexp_filter (venus)

Sam Ruby rubys at intertwingly.net
Thu Mar 15 00:59:03 EST 2007


Amit Chakradeo (अमित चक्रदेव) wrote:
> 
> 
> On 3/13/07, *Sam Ruby* <rubys at intertwingly.net 
> <mailto:rubys at intertwingly.net>> wrote:
> 
>     Amit Chakradeo (अमित चक्रदेव) wrote:
>      > Hi,
>      >
>      >    I am trying to filter out some items which have the string
>      > GolfNow.com. 
> 
> 
> ...
> 
>      > I tried using the following ways, but both of them did not work
>     (I still
>      > see articles containing the GolfNow.com string!)
>      > filters= regexp_sifter.py?exclude=GolfNow\.com
>      >
>      > filters= regexp_sifter.py?--exclude=GolfNow\.com
>      >
>      > But if I pass in the options on command line it seems to work:
>      > cat cache_item_file | python regexp_sifter.py  --exclude GolfNow\.com
>      > (no output which is good)
>      >
>      > ???
> 
>     I ran some tests, and the code seems to be working.
> 
>     Filters are applied when spidering, adding a filter later won't remove
>     anything from the cache.
> 
>     Could that be the point of confusion?  If so, that can be fixed for most
>     cases.... spider could be modified to actively delete cache entries
>     which have been filtered.  This will only remove entries which are
>     actually present in a feed, and only for feeds that actually change.
> 
> 
> Yes, that was the point of confusion. Thanks. Actually I just looked at 
> the architecture picture and it is very clear. For now I just deleted 
> the offending entries from cache and will check if it gets filtered. I 
> was just running splice.py with the hope that the filters will be applied.
> 
> Maybe the filters should have some configuration setting telling when to 
> apply them...  Though I can't imagine a use-case in which this would  be 
> useful...

I improved the documentation to make it clearer: see first note on the 
filters page.  I also added the code (along with a test) to ensure that 
filters, when run, will actively clean out old entries in the cache. 
While that doesn't cover all the cases, it might help...

- Sam Ruby


More information about the devel mailing list