Venus: parse errors on many feeds
Mary Gardiner
mary at puzzling.org
Fri Jun 19 12:33:55 EST 2009
I am getting parse errors on many feeds in a way that suggests Feed Parser is
failing. For an example, see the .ini file at
https://users.puzzling.org/users/mary/venus-test/test.ini which pulls in the
feed at https://users.puzzling.org/users/mary/venus-test/rss.xml (this feed is
originally at http://blog.gingertech.net/feed/ )
Note that if I use feedparser directly, it has no problem with the file:
$ pwd
/home/mary/src/venus/trunk/planet/vendor
$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.__file__
'feedparser.pyc'
>>> feedparser.parse('https://users.puzzling.org/users/mary/venus-test/rss.xml')
{'feed': {'lastbuilddate': u'Sun, 14 Jun 2009 14:15:54 +0000', 'subtitle': u"Silvia's blog" ...
However, if I run /home/mary/src/venus/trunk/planet.py test.ini, I get
HTMLParseError emerging from within feedparser:
$ /home/mary/src/venus/trunk/planet.py test.ini
/home/mary/src/venus/trunk/planet/reconstitute.py:16: DeprecationWarning: the md5 module is deprecated; use hashlib instead
import re, time, md5, sgmllib
ERROR:planet.runner:Error processing https://users.puzzling.org/users/mary/venus-test/rss.xml
ERROR:planet.runner:HTMLParseError: malformed start tag, at line 4, column 55
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/spider.py", line 437, in spiderPlanet
data = feedparser.parse(feed, **options)
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 3525, in parse
feedparser.feed(data)
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1662, in feed
sgmllib.SGMLParser.feed(self, data)
ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
self.goahead(0)
ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
ERROR:planet.runner: File "/usr/lib/python2.6/sgmllib.py", line 360, in finish_endtag
self.unknown_endtag(tag)
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 569, in unknown_endtag
method()
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1512, in _end_content
value = self.popContent('content')
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 849, in popContent
value = self.pop(tag)
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 764, in pop
mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 2218, in _parseMicroformats
p = _MicroformatsParser(htmlSource, baseURI, encoding)
ERROR:planet.runner: File "/home/mary/src/venus/trunk/planet/vendor/feedparser.py", line 1823, in __init__
self.document = BeautifulSoup.BeautifulSoup(data)
ERROR:planet.runner: File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
ERROR:planet.runner: File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
ERROR:planet.runner: File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
ERROR:planet.runner: File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
System details:
- Ubuntu 9.04
- Python 2.6.2 (Ubuntu package 2.6.2-0ubuntu1)
- Venus trunk revno 113, which seems to be the latest
-Mary
More information about the devel
mailing list