[Devel] Parsing dates of Xanga entries

Eliot Landrum eliot at landrum.cx
Wed Jun 15 03:04:11 EST 2005


Minh -

Even though I already talked to you on Jabber about this, I wanted to 
post this here for posterity.

I found this script by Michael Greene that cleans up Xanga RSS a little 
bit. I added the pubDate support and a little bit of a real title. So, 
basically, instead of:

<title>6/13/2005 5:25:33 PM</title>
<description>Well, it's over.&nbsp; Michael Jackson was found not guilty 
on all 10 counts.</description>

You'll get

<pubDate>Mon, 13 Jun 2005 17:25:33 -0400</pubDate>
<title>Well, it's over.  Michael...</title>
<description>Well, it's over.  Michael Jackson was found not guilty on 
all 10 counts.</description>

Planet seems to be pretty happy with that change!

So, just point Planet to the script on your webserver like 
http://host/xanga.php?username=xxxx

Eliot

Minh Nguyen wrote:

> The Planet that I unofficially maintain for my high school, called 
> Planet Xavier [1], syndicates a large number of blogs hosted by a 
> service called Xanga [2]. Xanga only provides feeds in RSS 0.91 form, 
> and includes post dates in the <title> element, using the following 
> format, in Eastern Standard Time:
>
> mm/dd/yyyy hh:mm:ss AM
>
> I'd like for Planet to parse the provided date for each post, but I'm 
> new to Python, and the sheer length of the feedparser.py script is 
> preventing me from seeing how I'd be able to do this. From my cursory 
> reading of the code, I'm pretty sure that the script can't parse the 
> date format, but that's something I can figure out. I'm just not sure 
> how to get Planet to read the <title> element instead of the <date> 
> element. If anyone can help me out, I'd appreciate that very much.
>
> [1] http://mxn.f2o.org/planet/xavier/
> [2] http://www.xanga.com/
>
>_______________________________________________
>Devel mailing list
>Devel at lists.planetplanet.org
>http://lists.planetplanet.org/mailman/listinfo/devel
>  
>

-------------- next part --------------
<?php

/*
   Xanga Feed Converter 0.4
   (C) 2004 Michael Greene, michael dot greene at gmail dot com
   Published under the MIT License
   Revision Date: 2004.10.19
*/

/* 
   Added pubDate support
   Eliot Landrum <eliot at landrum.cx>
   June 14, 2005
*/  

// Some aggregators don't like it if we don't tell them
// *exactly* what we're sending
header('Content-Type: text/xml; charset=utf-8');

// Get the username and form the URL for the Xanga feed
$username = $_GET['username'];
$feed = 'http://www.xanga.com/rss.aspx?user=' . $username;

// Grab the Xanga feed
ini_set('allow_url_fopen', true);
$fp = fopen($feed, 'r');
$xml = '';
while (!feof($fp)) {
$xml .= fread($fp, 128);
}
fclose($fp);

// Convert the malformed Xanga feed into a near-valid RSS feed
$xml = str_replace('</channel><item>','<item>',$xml);
$xml = str_replace('</rss>','</channel></rss>',$xml);

// Don't know why Xanga would send this like this but not anymore
$xml = str_replace('&amp;nbsp;',' ', $xml);

$xml = str_replace('&nbsp;',' ', $xml);

//$xml = str_replace('&lt;','<', $xml);

//$xml = str_replace('<br>','<br/>', $xml);


// Break the feed into <item> components
$items = explode('<item>', $xml);

// Add a description element to the channel for validity
$items[0] .= '<description>An RSS Feed of a Xanga.com Journal</description>';

// Make the titles a lot cooler
for ($i=1; $i<6; $i++) {
  // Grab the description part
  $description = strstr($items[$i], '<description>');
  // Get rid of any HTML
  $description = html_entity_decode($description);
  $description = strip_tags($description);
  // Extract the first 25 characters of that
  $extra = substr($description, 0, 25);
  // Separate the title from the rest
  $itempieces = explode('</title>', $items[$i]);
  // Take out the title tag
  $itempieces[0] = str_replace('<title>','', $itempieces[0]);
  // Take the contents of what was the title and convert it to an RFC-822 date
  // Then add it as a pubDate element
  $itempieces[0] = '<pubDate>' . date("r",strtotime($itempieces[0])) . '</pubDate>';
  // Add the pubDate plus the first 25 characters of the content as the title
  $itempieces[0] = $itempieces[0] . '<title>' . $extra . '...';
  // Put the pieces of the item back together
  $items[$i] = implode('</title>', $itempieces);
}

// Put the items back together
$xml = implode('<item>', $items);

// Output the final RSS feed
echo $xml;

?>


More information about the Devel mailing list