with Lorelle and Brent VanFossen

Imitating MovableType Import File

When I’d search and replaced the parts of each page that could be eliminated, I needed to start the next series of search and replaces which would turn the information into the MoveableType format. Following the guidelines set up in the article MoveableType Instructions for Importing Data, I had a format to follow and a lot of work ahead of me to squish my static HTML files into MovableType entries.

The format features “fields” of data separated by dashed lines. Records (pages) are divided by an eight dash line followed by a line break code like this between each record (page of data):

--------\n

Between each field of the record there needed to be a five dash line and line break code:

-----\n

Not every bit of information is cleanly separated. Some of it is grouped together with related bits of information into a single “field”. The first group of information looks like this:

--------\n
-----\n
TITLE: Foo Bar
AUTHOR: Bar Foo
DATE: 11/25/2004 03:31:05 PM
PRIMARY CATEGORY: Fruit
CATEGORY: Apple
-----\n

The Order of the Import File Data

At this point, I started questioning the order of the information in the import file. If the import-mt.php file could recognize the titles for TITLE, AUTHOR, DATE, etc., then wouldn’t they just grab the correct information and stick it in the right database field without there being an order to the information. If I had it AUTHOR, DATE, TITLE, etc., would it care?

Well, I did a couple of tests and found out that the import-mt.php file from WordPress does care. It cares VERY precisely. The order has to be:

--------\n
TITLE
AUTHOR
DATE
PRIMARY CATEGORY
CATEGORY
-----\n
BODY
-----\n
EXTENDED BODY
-----\n
EXCERPT
-----\n
COMMENT
AUTHOR
DATE
IP
EMAIL
This is the body of this comment.
-----\n
COMMENT
AUTHOR
DATE
IP
EMAIL
This is the body of another comment. It goes up to here.
-----\n
PING
TITLE
URL
IP
BLOG NAME
DATE
Pinged entry text goes here.
-----\n
--------\n

While the order is critical, if the item is blank, it can be left blank or just removed and it will still work. It just has to be in that order, or you end up getting your Date information in your Title, etc.

This conceptual process had started out simple and easy – find the code and replace it with nothing to erase it, or replace it with the new code that I needed for the import. Now I had to consider the order of the data and how to minimize manual editing. It was time to become a WordPress Importing Detective.

Always think five steps ahead
but prepare to take two steps backwards.
My motto for this endeavor.

Enter the Importing Detective

Looking over my remaining data from my static HTML pages and the structure of the import file layout, I worked to match related material. Part of it was easy.

I knew I would be importing ALL of these pages into a single document, so I needed to divide up each one with the beginning and ending tags as used in the MovableType import file. Every HTML document begins with an <html> tag. While I’d cleaned out the document type codes, I still had the HTML tag in place, thinking ahead. So I searched and replaced the HTML tag across the multiple files with:

--------\n

The dashed lines after the post/article content, called the Body in Movabletype, also turned out to be easy. Since I know the code for separating records (pages) is the eight dashed line, I can replace each page’s end HTML codes:

</body>
</html>

With:

-----\n
--------\n

I knew that there would be code between the record start set of dashed lines and the end of the Body data and the closing field and record codes, but I’d deal with that in the manual edits.

The elements I had clearly defined during my first multiple file search and replaces were the author and body. The title was a little more difficult.

If your page <title> is the same as the actual title of your page, you don’t have a problem. Search and replace just got very easy. The problem for me is that my <title> tag contains a combination of the article series title and the title of the article. So instead of just being “Validating Your Website”, the HTML page <title> is “Website Development – Validating Your Website”. Within the actual body of the article, I keep the article series title in H1 and the article title in H2. WordPress sees things differently, as does the import file, which is another reason why it is so important to understand how WordPress interprets different bits of information compared to your site, so you know where to make your compromises, pushes, and shoves to turn your material into WordPress material.

I debated over this for a while, trying to figure out which would be smarter, to use the web page title or the article tile in the H2 tags. I finally decided to keep my article title in the H2 heading and deal with the article series within WordPress categories. I now needed to figure out how to change my H2 tag to match the needs of the import.

In my HTML, I began with this:

<h2>Start With Compliance</h2>

The first search and replace changed it to:

-----\n
TITLE: Start With Compliance</h2>

This meant I had the start of the first field of information, along with the TITLE: tag. Yes, there is still code and junk between the beginning “record” line and the Title tag, but this was a start to form the final file.

Now, if I hadn’t been thinking ahead, I would have simply done a search and replace for the

and replaced it with nothing, since that would match what I wanted (which I did and then realized my mistake and took a step backwards), but I had to think ahead and do some seriously planning.

The rest of the field included the Author, Date, Primary Category, and Category and then a field end and then the BODY. Since I don’t have the first three items in my HTML files, I needed to put them in manually add the information later. I also needed to but them in the right “place” in the order of the import. Trying to find another unique identifier such as the closing tag of h2 meant serious work.

The AUTHOR follows the TITLE, and in my HTML, that was also the case. So what I needed to do was search and replace the AUTHOR tags to change them to the MoveableType format and then add the import information regarding the Date, Primary Category, Category, and Body as placeholders, awaiting my manual edit time so I could add the information then.

I decided I would first replace the author tags, before doing the title tags, so I would still have the closing H2 tag to clean up the title tags, if needed.

On our site, we have three authors: Brent VanFossen, Lorelle VanFossen, and Lorelle and Brent VanFossen. This meant I had to do the search and replace three times to cover the three name combinations. Using the first one, I did a search for:

<div id="author">By Brent VanFossen</div>

Replacing it with:

AUTHOR: Brent VanFossen
DATE:
PRIMARY CATEGORY:
CATEGORY:
-----\n
BODY:

Since that worked, I did another search and replace to get rid of the closing H2 tag (</h2>). Ah, success by the WordPress Importing Detective!

Before celebrating, though, I soon realized I would have to go through manually and delete the <TITLE> and H1 information. I will have a good bit of manual cleaning to do as I sometimes have line breaks in my <h2> tags, and there will be a bit of unwanted but unique code that will have to be removed manually, but I have an idea on how to expedite this process…later.

Last Bits of Multiple File Search and Replaces

I now had only a few bits of information that I could realistically do over multiple file search and replaces. Much of the rest of the data would have to be inspected manually.

Since I don’t have any comments or pings, that part of the import could be ignored. But there was more to deal with.

The Excerpt turned out to not be simple. My static HTML files didn’t have excerpts. I wrote them myself on the static category and information pages. With the move to dynamically generated web pages, an Excerpt would have more power.

I decided that I would make an attempt to at least have some kind of Excerpt by using the meta tag “description” for each web page. Since this information was still laying in the multiple test files because it was unique to every page, and awaiting manual editing, I figured I could use it until I changed my mind.

The tag for the Excerpt just needed replacing at the beginning:

<meta name="description" content="

with:

-----\n
EXCERPT:

The problem was that the ending tag wasn’t anything special. Just a quote and a space />. If I searched and replaced that, I would have a mess of all of my self-closing XHTML tags. I needed to find something unique to match that ending tag, or manually delete it during the edit phase.

The unique match was the keywords meta tag I had still remaining below the description meta tag. This is one of the reasons why using a search and replace program or utility needs to be able to do multiple lines. I did a search for:

" />
<meta name="keywords"

and replaced it with:

-----\n
<meta name="keywords"

I decided to replace the “meta name” information as I was still undecided about how to handle the keywords, and if WordPress would even have a place for them in their database. More stuff to research and learn.

With the core elements in place, or close to it, I took another look at what elements I could search and replace to eliminate. I found some Amazon ads that I would replace in the template files, javascript codes I wouldn’t be using any more, some stray bits of redundant code, and actually managed to shrink my former static html web pages to mostly the content needed to go into the database via the import file. Using the search and replace method across multiple files cleaned things up.

It was now time to bring in my test html files into a single document in preparation for the final cleaning and importing into WordPress. The hard work is just beginning.

One Comment

  • Posted November 18, 2005 at 6:29 | Permalink

    Sorry for invasion..
    There are program, named RQ Search and Replace.
    “RQ Search and Replace is easy to use batch text replacement utility for Windows…”
    One of the its feature is the special HTML tags processing. Program can parse HTML code and process HTML tags only, not other text.
    For HTML tags next operations are available:
    Delete tag – deleting all tags that contains “search pattern”
    You can delete start tag, corresponding end tag and/or/nor all text (and other tags) enclosed in this tag.
    Also you can
    Change attribute
    Delete attribute
    Add attribute – Add the attribute with given value and name.
    etc..
    Of course other “objects” – text blocks, strings “enclosed blocks” can be processed too.
    Posiible using this program helps to convert files more easy.
    RQSR link – http://mira.home.line1.ru/rqsr.html

    WBR, Andrew.

    ps. sorry fo my English.

Post a Comment

Your email is kept private. Required fields are marked *