Failed parsing of an ATOM feed

Sam Ruby rubys at intertwingly.net
Tue Jun 3 08:22:37 EST 2008


On Mon, Jun 2, 2008 at 3:47 PM, Vasil Kolev <vasil at ludost.net> wrote:
> This is what I get from this feed:
>
> INFO:planet.runner:Updating feed http://debian.fmi.uni-sofia.bg/~ogi/blog/feeds/atom10.xml
> ERROR:planet.runner:Error processing http://debian.fmi.uni-sofia.bg/~ogi/blog/feeds/atom10.xml
> ERROR:planet.runner:UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)
> ERROR:planet.runner:  File "/home/vasil/www/pesho/venus/planet/spider.py", line 468, in spiderPlanet
>    writeCache(uri, feed_info, data)
> ERROR:planet.runner:  File "/home/vasil/www/pesho/venus/planet/spider.py", line 214, in writeCache
>    output = xdoc.toxml().encode('utf-8')
> ERROR:planet.runner:  File "xml/dom/minidom.py", line 47, in toxml
> ERROR:planet.runner:  File "xml/dom/minidom.py", line 62, in toprettyxml
> ERROR:planet.runner:  File "StringIO.py", line 271, in getvalue
>    self.buf += ''.join(self.buflist)
>
> Looking at it, doesn't seem to be any problem with it (especially at
> position 0), and I don't seem to find 0xd1 anywhere where it should
> matter, any ideas? I updated to the latest snapshot and still see this.

This probably doesn't help much, but the feed is not well formed.
Once the feed is not well formed, the feed parser applies a bunch of
heuristics for trying to salvage what data it can.  Those heuristics
may not be as good at extracting meaning from cyrillic text as latin
text.  I can try to debug further, but IMHO a fix upstream would also
be worth pursuing.

http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fdebian.fmi.uni-sofia.bg%2F%7Eogi%2Fblog%2Ffeeds%2Fatom10.xml#l129

- Sam Ruby


More information about the devel mailing list