[Devel] encoding fix for planet

Gediminas Paulauskas menesis at delfi.lt
Tue Feb 24 04:39:22 EST 2004


I have downloaded planet-devel.tar.bz2 and fired it up succesfully for use 
for a group of friends. I have found a few problems.

1. Most of the feeds were in windows-1257 encoding, and that is written in 
the header, like <?xml version="1.0" encoding="1257"?> . Planet did not 
convert and display them correctly. I modified planetlib.py to use the 
specified encoding, not hardcoded iso-8859-1, and it works well now. The 
patch planet-encoding.diff contains this fix.

2. In example/index.html.tmpl would be good to add this line in <head>, 
so that browsers know which charset to choose.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

3. Example feed templates do not escape links in all needed places, so 
produced xml files are invalid. The patch planet-feeds.diff fixes that. 
Test with http://pukomuko.esu.lt/index.php?page=index&item=rss

4. INSTALL file refers to pyblagg.py, should be planet.py

     Run it: python pyblagg.py pathto/config.ini


Thanks for the program, was easy to make it go!

Gediminas Paulauskas

-------------- next part --------------
--- planet-devel/planetlib.py	2004-02-09 01:23:58.000000000 +0200
+++ planet/planetlib.py	2004-02-23 19:12:28.000000000 +0200
@@ -180,7 +180,8 @@
 
         We then take the data and try to squeeze it into a UTF-8 string
         using Python's unicode module.  If it doesn't decode as UTF-8
-        we try ISO-8559-1 before ruthlessly stripping the bad characters.
+        we try using encoding specified in xml header (or ISO-8859-1 if
+        not specified) before ruthlessly stripping the bad characters.
         """
         data = f.read()
 
@@ -197,13 +198,19 @@
             data = unicode(data, "utf8").encode("utf8")
             logging.debug("Encoding: UTF-8")
         except UnicodeError:
+            encoding = "iso8859_1"
+            xmlheaderRe = re.compile('<\?.*encoding=[\'"](.*?)[\'"].*\?>')
+            match = xmlheaderRe.match(data)
+            if match:
+                encoding = match.groups()[0].lower()
+
             try:
-                data = unicode(data, "iso8859_1").encode("utf8")
-                logging.debug("Encoding: ISO-8859-1")
+                data = unicode(data, encoding).encode("utf8")
+                logging.debug("Encoding: " + encoding)
             except UnicodeError:
                 data = unicode(data, "ascii", "replace").encode("utf8")
-                logging.warn("Feed wasn't in UTF-8 or ISO-8859-1, replaced " +
-                             "all non-ASCII characters.")
+                logging.warn("Feed wasn't in UTF-8 or " + encoding +
+                             ", replaced all non-ASCII characters.")
 
         return data
 
-------------- next part --------------
diff -u planet-devel/examples/foafroll.xml.tmpl planet/examples/foafroll.xml.tmpl
--- planet-devel/examples/foafroll.xml.tmpl	2004-02-09 01:23:58.000000000 +0200
+++ planet/examples/foafroll.xml.tmpl	2004-02-23 16:44:00.000000000 +0200
@@ -8,7 +8,7 @@
 >
 <foaf:Group>
 	<foaf:name><TMPL_VAR name></foaf:name>
-	<foaf:homepage><TMPL_VAR link></foaf:homepage>
+	<foaf:homepage><TMPL_VAR link ESCAPE="HTML"></foaf:homepage>
 	<rdfs:seeAlso rdf:resource="<TMPL_VAR uri ESCAPE="HTML">" />
 
 <TMPL_LOOP Channels>
diff -u planet-devel/examples/rss10.xml.tmpl planet/examples/rss10.xml.tmpl
--- planet-devel/examples/rss10.xml.tmpl	2004-02-09 01:23:58.000000000 +0200
+++ planet/examples/rss10.xml.tmpl	2004-02-23 16:44:02.000000000 +0200
@@ -8,8 +8,8 @@
 >
 <channel rdf:about="<TMPL_VAR link ESCAPE="HTML">">
 	<title><TMPL_VAR name></title>
-	<link><TMPL_VAR link></link>
-	<description><TMPL_VAR name> - <TMPL_VAR link></description>
+	<link><TMPL_VAR link ESCAPE="HTML"></link>
+	<description><TMPL_VAR name> - <TMPL_VAR link ESCAPE="HTML"></description>
 
 	<items>
 		<rdf:Seq>
@@ -23,7 +23,7 @@
 <TMPL_LOOP Items>
 <item rdf:about="<TMPL_VAR id ESCAPE="HTML">">
 	<title><TMPL_VAR channel_name><TMPL_IF title>: <TMPL_VAR title></TMPL_IF></title>
-	<link><TMPL_VAR link></link>
+	<link><TMPL_VAR link ESCAPE="HTML"></link>
 	<TMPL_IF content>
 	<content:encoded><TMPL_VAR content ESCAPE="HTML"></content:encoded>
 	</TMPL_IF>
diff -u planet-devel/examples/rss20.xml.tmpl planet/examples/rss20.xml.tmpl
--- planet-devel/examples/rss20.xml.tmpl	2004-02-09 01:23:58.000000000 +0200
+++ planet/examples/rss20.xml.tmpl	2004-02-23 16:45:14.000000000 +0200
@@ -2,15 +2,15 @@
 <rss version="2.0">
 <channel>
 	<title><TMPL_VAR name></title>
-	<link><TMPL_VAR link></link>
+	<link><TMPL_VAR link ESCAPE="HTML"></link>
 	<language>en</language>
-	<description><TMPL_VAR name> - <TMPL_VAR link></description>
+	<description><TMPL_VAR name> - <TMPL_VAR link ESCAPE="HTML"></description>
 
 <TMPL_LOOP Items>
 <item>
 	<title><TMPL_VAR channel_name><TMPL_IF title>: <TMPL_VAR title></TMPL_IF></title>
-	<guid><TMPL_VAR id></guid>
-	<link><TMPL_VAR link></link>
+	<guid><TMPL_VAR id ESCAPE="HTML"></guid>
+	<link><TMPL_VAR link ESCAPE="HTML"></link>
 	<TMPL_IF content>
 	<description><TMPL_VAR content ESCAPE="HTML"></description>
 	</TMPL_IF>


More information about the Devel mailing list