Venus: parse errors on many feeds
Sam Ruby
rubys at intertwingly.net
Thu Jun 25 09:09:48 EST 2009
Mary Gardiner wrote:
> I am getting parse errors on many feeds in a way that suggests Feed Parser is
> failing. For an example, see the .ini file at
> https://users.puzzling.org/users/mary/venus-test/test.ini which pulls in the
> feed at https://users.puzzling.org/users/mary/venus-test/rss.xml (this feed is
> originally at http://blog.gingertech.net/feed/ )
Is this still an issue?
Can you try the following:
python tests/reconstitute.py \
https://users.puzzling.org/users/mary/venus-test/rss.xml
- Sam Ruby
> Note that if I use feedparser directly, it has no problem with the file:
>
> $ pwd
> /home/mary/src/venus/trunk/planet/vendor
> $ python
> Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
> [GCC 4.3.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import feedparser
>>>> feedparser.__file__
> 'feedparser.pyc'
>>>> feedparser.parse('https://users.puzzling.org/users/mary/venus-test/rss.xml')
> {'feed': {'lastbuilddate': u'Sun, 14 Jun 2009 14:15:54 +0000', 'subtitle': u"Silvia's blog" ...
>
> However, if I run /home/mary/src/venus/trunk/planet.py test.ini, I get
> HTMLParseError emerging from within feedparser:
>
> $ /home/mary/src/venus/trunk/planet.py test.ini
> /home/mary/src/venus/trunk/planet/reconstitute.py:16: DeprecationWarning: the md5 module is deprecated; use hashlib instead
> import re, time, md5, sgmllib
> ERROR:planet.runner:Error processing https://users.puzzling.org/users/mary/venus-test/rss.xml
> ERROR:planet.runner:HTMLParseError: malformed start tag, at line 4, column 55
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/spider.py", line 437, in spiderPlanet
> data = feedparser.parse(feed, **options)
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 3525, in parse
> feedparser.feed(data)
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1662, in feed
> sgmllib.SGMLParser.feed(self, data)
> ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
> self.goahead(0)
> ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 143, in goahead
> k = self.parse_endtag(i)
> ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 320, in parse_endtag
> self.finish_endtag(tag)
> ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 360, in finish_endtag
> self.unknown_endtag(tag)
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 569, in unknown_endtag
> method()
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1512, in _end_content
> value = self.popContent('content')
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 849, in popContent
> value = self.pop(tag)
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 764, in pop
> mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 2218, in _parseMicroformats
> p = _MicroformatsParser(htmlSource, baseURI, encoding)
> ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1823, in __init__
> self.document = BeautifulSoup.BeautifulSoup(data)
> ERROR:planet.runner: File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1499, in __init__
> BeautifulStoneSoup.__init__(self, *args, **kwargs)
> ERROR:planet.runner: File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1230, in __init__
> self._feed(isHTML=isHTML)
> ERROR:planet.runner: File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1263, in _feed
> self.builder.feed(markup)
> ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
> self.goahead(0)
> ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
> k = self.parse_starttag(i)
> ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
> endpos = self.check_for_whole_start_tag(i)
> ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
> self.error("malformed start tag")
> ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
> raise HTMLParseError(message, self.getpos())
>
> System details:
> - Ubuntu 9.04
> - Python 2.6.2 (Ubuntu package 2.6.2-0ubuntu1)
> - Venus trunk revno 113, which seems to be the latest
>
> -Mary
More information about the devel
mailing list