[Devel] encoding fix for planet
Gediminas Paulauskas
menesis at delfi.lt
Tue Feb 24 04:39:22 EST 2004
I have downloaded planet-devel.tar.bz2 and fired it up succesfully for use
for a group of friends. I have found a few problems.
1. Most of the feeds were in windows-1257 encoding, and that is written in
the header, like <?xml version="1.0" encoding="1257"?> . Planet did not
convert and display them correctly. I modified planetlib.py to use the
specified encoding, not hardcoded iso-8859-1, and it works well now. The
patch planet-encoding.diff contains this fix.
2. In example/index.html.tmpl would be good to add this line in <head>,
so that browsers know which charset to choose.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
3. Example feed templates do not escape links in all needed places, so
produced xml files are invalid. The patch planet-feeds.diff fixes that.
Test with http://pukomuko.esu.lt/index.php?page=index&item=rss
4. INSTALL file refers to pyblagg.py, should be planet.py
Run it: python pyblagg.py pathto/config.ini
Thanks for the program, was easy to make it go!
Gediminas Paulauskas
-------------- next part --------------
--- planet-devel/planetlib.py 2004-02-09 01:23:58.000000000 +0200
+++ planet/planetlib.py 2004-02-23 19:12:28.000000000 +0200
@@ -180,7 +180,8 @@
We then take the data and try to squeeze it into a UTF-8 string
using Python's unicode module. If it doesn't decode as UTF-8
- we try ISO-8559-1 before ruthlessly stripping the bad characters.
+ we try using encoding specified in xml header (or ISO-8859-1 if
+ not specified) before ruthlessly stripping the bad characters.
"""
data = f.read()
@@ -197,13 +198,19 @@
data = unicode(data, "utf8").encode("utf8")
logging.debug("Encoding: UTF-8")
except UnicodeError:
+ encoding = "iso8859_1"
+ xmlheaderRe = re.compile('<\?.*encoding=[\'"](.*?)[\'"].*\?>')
+ match = xmlheaderRe.match(data)
+ if match:
+ encoding = match.groups()[0].lower()
+
try:
- data = unicode(data, "iso8859_1").encode("utf8")
- logging.debug("Encoding: ISO-8859-1")
+ data = unicode(data, encoding).encode("utf8")
+ logging.debug("Encoding: " + encoding)
except UnicodeError:
data = unicode(data, "ascii", "replace").encode("utf8")
- logging.warn("Feed wasn't in UTF-8 or ISO-8859-1, replaced " +
- "all non-ASCII characters.")
+ logging.warn("Feed wasn't in UTF-8 or " + encoding +
+ ", replaced all non-ASCII characters.")
return data
-------------- next part --------------
diff -u planet-devel/examples/foafroll.xml.tmpl planet/examples/foafroll.xml.tmpl
--- planet-devel/examples/foafroll.xml.tmpl 2004-02-09 01:23:58.000000000 +0200
+++ planet/examples/foafroll.xml.tmpl 2004-02-23 16:44:00.000000000 +0200
@@ -8,7 +8,7 @@
>
<foaf:Group>
<foaf:name><TMPL_VAR name></foaf:name>
- <foaf:homepage><TMPL_VAR link></foaf:homepage>
+ <foaf:homepage><TMPL_VAR link ESCAPE="HTML"></foaf:homepage>
<rdfs:seeAlso rdf:resource="<TMPL_VAR uri ESCAPE="HTML">" />
<TMPL_LOOP Channels>
diff -u planet-devel/examples/rss10.xml.tmpl planet/examples/rss10.xml.tmpl
--- planet-devel/examples/rss10.xml.tmpl 2004-02-09 01:23:58.000000000 +0200
+++ planet/examples/rss10.xml.tmpl 2004-02-23 16:44:02.000000000 +0200
@@ -8,8 +8,8 @@
>
<channel rdf:about="<TMPL_VAR link ESCAPE="HTML">">
<title><TMPL_VAR name></title>
- <link><TMPL_VAR link></link>
- <description><TMPL_VAR name> - <TMPL_VAR link></description>
+ <link><TMPL_VAR link ESCAPE="HTML"></link>
+ <description><TMPL_VAR name> - <TMPL_VAR link ESCAPE="HTML"></description>
<items>
<rdf:Seq>
@@ -23,7 +23,7 @@
<TMPL_LOOP Items>
<item rdf:about="<TMPL_VAR id ESCAPE="HTML">">
<title><TMPL_VAR channel_name><TMPL_IF title>: <TMPL_VAR title></TMPL_IF></title>
- <link><TMPL_VAR link></link>
+ <link><TMPL_VAR link ESCAPE="HTML"></link>
<TMPL_IF content>
<content:encoded><TMPL_VAR content ESCAPE="HTML"></content:encoded>
</TMPL_IF>
diff -u planet-devel/examples/rss20.xml.tmpl planet/examples/rss20.xml.tmpl
--- planet-devel/examples/rss20.xml.tmpl 2004-02-09 01:23:58.000000000 +0200
+++ planet/examples/rss20.xml.tmpl 2004-02-23 16:45:14.000000000 +0200
@@ -2,15 +2,15 @@
<rss version="2.0">
<channel>
<title><TMPL_VAR name></title>
- <link><TMPL_VAR link></link>
+ <link><TMPL_VAR link ESCAPE="HTML"></link>
<language>en</language>
- <description><TMPL_VAR name> - <TMPL_VAR link></description>
+ <description><TMPL_VAR name> - <TMPL_VAR link ESCAPE="HTML"></description>
<TMPL_LOOP Items>
<item>
<title><TMPL_VAR channel_name><TMPL_IF title>: <TMPL_VAR title></TMPL_IF></title>
- <guid><TMPL_VAR id></guid>
- <link><TMPL_VAR link></link>
+ <guid><TMPL_VAR id ESCAPE="HTML"></guid>
+ <link><TMPL_VAR link ESCAPE="HTML"></link>
<TMPL_IF content>
<description><TMPL_VAR content ESCAPE="HTML"></description>
</TMPL_IF>
More information about the Devel
mailing list