Venus: parse errors on many feeds

Sam Ruby rubys at intertwingly.net
Thu Jun 25 09:09:48 EST 2009


Mary Gardiner wrote:
> I am getting parse errors on many feeds in a way that suggests Feed Parser is
> failing. For an example, see the .ini file at
> https://users.puzzling.org/users/mary/venus-test/test.ini which pulls in the
> feed at https://users.puzzling.org/users/mary/venus-test/rss.xml (this feed is
> originally at http://blog.gingertech.net/feed/ )

Is this still an issue?

Can you try the following:

python tests/reconstitute.py \
   https://users.puzzling.org/users/mary/venus-test/rss.xml

- Sam Ruby

> Note that if I use feedparser directly, it has no problem with the file:
> 
> $ pwd
> /home/mary/src/venus/trunk/planet/vendor
> $ python
> Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) 
> [GCC 4.3.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import feedparser
>>>> feedparser.__file__
> 'feedparser.pyc'
>>>> feedparser.parse('https://users.puzzling.org/users/mary/venus-test/rss.xml')
> {'feed': {'lastbuilddate': u'Sun, 14 Jun 2009 14:15:54 +0000', 'subtitle': u"Silvia's blog" ...
> 
> However, if I run /home/mary/src/venus/trunk/planet.py test.ini, I get
> HTMLParseError emerging from within feedparser:
> 
> $ /home/mary/src/venus/trunk/planet.py test.ini
> /home/mary/src/venus/trunk/planet/reconstitute.py:16: DeprecationWarning: the md5 module is deprecated; use hashlib instead
>   import re, time, md5, sgmllib
> ERROR:planet.runner:Error processing https://users.puzzling.org/users/mary/venus-test/rss.xml
> ERROR:planet.runner:HTMLParseError: malformed start tag, at line 4, column 55
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/spider.py", line 437, in spiderPlanet
>     data = feedparser.parse(feed, **options)
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 3525, in parse
>     feedparser.feed(data)
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1662, in feed
>     sgmllib.SGMLParser.feed(self, data)
> ERROR:planet.runner:  File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
>     self.goahead(0)
> ERROR:planet.runner:  File "/usr/lib/python2.6/sgmllib.py", line 143, in goahead
>     k = self.parse_endtag(i)
> ERROR:planet.runner:  File "/usr/lib/python2.6/sgmllib.py", line 320, in parse_endtag
>     self.finish_endtag(tag)
> ERROR:planet.runner:  File "/usr/lib/python2.6/sgmllib.py", line 360, in finish_endtag
>     self.unknown_endtag(tag)
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 569, in unknown_endtag
>     method()
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1512, in _end_content
>     value = self.popContent('content')
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 849, in popContent
>     value = self.pop(tag)
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 764, in pop
>     mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 2218, in _parseMicroformats
>     p = _MicroformatsParser(htmlSource, baseURI, encoding)
> ERROR:planet.runner:  File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1823, in __init__
>     self.document = BeautifulSoup.BeautifulSoup(data)
> ERROR:planet.runner:  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1499, in __init__
>     BeautifulStoneSoup.__init__(self, *args, **kwargs)
> ERROR:planet.runner:  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1230, in __init__
>     self._feed(isHTML=isHTML)
> ERROR:planet.runner:  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1263, in _feed
>     self.builder.feed(markup)
> ERROR:planet.runner:  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
>     self.goahead(0)
> ERROR:planet.runner:  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
>     k = self.parse_starttag(i)
> ERROR:planet.runner:  File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
>     endpos = self.check_for_whole_start_tag(i)
> ERROR:planet.runner:  File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
>     self.error("malformed start tag")
> ERROR:planet.runner:  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
>     raise HTMLParseError(message, self.getpos())
> 
> System details:
>  - Ubuntu 9.04
>  - Python 2.6.2 (Ubuntu package 2.6.2-0ubuntu1)
>  - Venus trunk revno 113, which seems to be the latest
> 
> -Mary



More information about the devel mailing list