with Lorelle and Brent VanFossen

Creating One Big Import File

I’d validated a lot of the HTML of my web pages, then ran it through a conversion from HTML to XHTML, and searched and replaced across the multiple files to eliminate the redundant styles, HTML structures, meta tags, javascripts, and unwanted elements, reducing it down to the Title, Author, Excerpt, and Body content. What remained was a lot of blank lines and empty space.

I needed to now import all of this information from my 65 test pages into one file to prepare it for import into the WordPress database. But I wanted to make one last clean up of all the “spaces” in the HTML files.

Using the multiple file search and replace utility which permits multiple line search and replace within HotDog Pro, I simply replaced three empty line breaks with two. Luckily for me, HotDog Pro goes through every file until ALL of the replacements are done, so I didn’t have to repeat the search. You might want to repeat the search a couple of times if you are following the bouncing Lorelle on her path and repeating her mistakes and using different search and replace software.

There, now I felt it was ready to become one big messy file instead of 65 individually messed up files.

Considering the WordPress Date

I don’t care what date an article was written or published, though I do like knowing the last time it was updated, which keeps me on track of things I need to check, especially with timely information. In WordPress, though, the date plays several key roles.

WordPress organizes all posts in chronological order. Since it is designed and developed for bloggers, date and time plays a critical role in being “the first out of the gate” with information, or following the chronological “story” of a person’s journal, like a diary. Since both of these don’t matter to me, I needed to understand how WordPress controls the “next” and “previous” posts using dates in order to keep my article series in some kind of order.

After a bit of research, I learned that articles/posts are organized by most recent to oldest. Therefore, in an article series, the date order should be:

Article 1 March 15
Article 2 March 14
Article 3 March 13
Article 4 March 12
Article 5 March 11
Article 6 March 10

This meant that I would have to order the Dates manually, from most recent date to oldest date, in order to maintain some element of sequence of articles as people moved through my site. Oh, joy. I still needed to set the date for each page in the import file. So I had to add the WordPress Date issues to the things I have to fuss over when it comes time to handle the manual edits of my import file.

Grouped by Categories

On my static HMTL site, all of my pages were grouped by category, using the category as the folder name. By working on each folder, one at a time, instead of the whole site, it was now easier to add the categories for each set of files. At least one part of this process might be easy after all. Later I can easily add subcategories for these in WordPress, but for now, they are all categorized nicely.

So I did a search and replace across the multiple files in the Learning folder for the import file title PRIMARY CATEGORY: and replaced it with PRIMARY CATEGORY: Learning – or whatever the category is. Simple and easy.

Remember Those Backups

Again, I had to be very careful with my search and replaces. A little slip screwed things up. I had to make frequent backups as I went along. Let me take a moment to tell you how I did this.

Since everything I’m working on is a copy of the files from my website, and I was working in a test folder on my hard drive, I would simply right click on that folder and save it to a zip or RAR file. I’d back sure that the options were set for preserving folders, and within a few seconds, that folder would be completely backed up – for that “just in case” moment that happens, more often than I’d care to admit. Since it does happen, this fast backup process became routine to make after every five changes I’d make, or before I knew I was going to try something risky and wanted “to be safe”.

I’d just rename the backup file each time to the date and time, so they would look like this:

site011005-1036.zip
site011005-2150.zip

Which would represent a backup on January 10, 2005, at 10:36 AM and the second one on the same date at 9:50 PM, using military time. Very simple and easy to identify. When I’m finally done with all of this and a month has gone by, unless I need the hard drive space before, then I will delete all these backups…again, just in case I find out I really screwed up way back when.

Preparing the Single Import File

Now that I’ve done most of the search and replaces of the easy code, I need to merge all the multiple files into one and work on that in preparation for importing.

There are several freeware and shareware programs out there which merge files together. I choose Fauland’s A.F.7.’s Merge which is simple and easy to use. Within less than 5 seconds it merged 65 html files and allowed me to save the file as html, txt, or whatever. That’s nice.

Now, the following is a warning to those who don’t know much about computers and software in general. I do not recommend you do what I did, use WordPerfect for all my final preparation for the import file for WordPress. I recommend that you stick with a basic text editor like PSPad. If you do go ahead, consider yourself warned. Word processing programs, unless they are tweaked to death to prevent problems, are terrible to use to write software code. You are warned.

With all of the 65 documents set in one file, I needed to clean it up and go through and manually edit all the information that was missing or needed moving.

Confronted with a bunch of similar but unique bits on information left on each page, like the meta tags, I needed to come up with a way to quickly remove the information without going through hundreds of physical pages. So I created a macro in WordPerfect.

Instead of manually removing every one of them, I created a macro in WordPerfect with a repeating loop. Before creating the macro, I did a search and replace for the front element of the information I wanted removed. In one of the example items I needed to remove, this is what it originally looked like:

<meta name="abstract" content="Search Engine Preparation - Designing a web page with search engines in mind.">

I did a search for:

<meta name="abstract" content="

and replaced it with XXXX:. The result in my example became:

XXXX:Search Engine Preparation - Designing a web page with search engines in mind.">

The XXXX: is a unique string of characters that wouldn’t be found within the document, so each line to be deleted now had a specific marker. I repeated this for the rest of the unique elements and tags that would need to be deleted from the document.

Using the powerful macro capabilities in WordPerfect, I created a macro that searched for XXXX: and then from the EDIT menu choose SELECT > PARAGRAPH. I did this twice and turned off the macro recorder to make sure I had the process embedded in the macro. I then edited the macro with a loop to repeat. In WordPerfect 12, the macro looks like this (you can copy and paste this into your own macro if you want):

Application (WordPerfect; "WordPerfect"; Default!; "EN")
Label (one)
SearchString (StrgToLookFor: "XXXX:")
SearchNext (SearchMode: Extended!)
SelectParagraph ()
DeleteCharNext ()
SearchString (StrgToLookFor: "XXXX:")
SearchNext (SearchMode: Extended!)
SelectParagraph ()
DeleteCharNext ()
Go (one)

When it reaches the end, because there is no “end-if” statement, it just reports an error of “not found”, for which you click “okay” and it’s done. This isn’t pretty, just quick.

If any of these lines had line breaks, I knew I would still have to manually go through and remove them, but the more removed in the process, the less I have to do manually. This quickly did away with all the little bits and pieces not going into the final import file.

Manually Inspect Your Import File

It was finally the moment I’d dreaded. It was time to go through and manually inspect the file to see what damage I’d done or hadn’t done. I found some things in my test batch of 65 files that will improve the process for me in the future, but there are still some things that are harder to fix.

I left the big spaces from all the removed meta tags in the document at this point because the huge space was a clue to the break between sections as I scrolled down the huge merged file. But this wasn’t good enough for really “seeing” what was where and where it should and shouldn’t be. With everything black and white on the screen, I was having trouble seeing what was HTML code and what was junk that had to be removed.

Using all the tools at my hand, I copied the content from WordPerfect and pasted it into a blank file in HotDog Pro so that the HTML tags would be colorized, helping me distinguish between what is HTML and what is text. The dividers for the fields come out as text, without tags around them, so they are clearly visible as I scrolled through the document.

Because I chose to use my H2 header as my title, there were line breaks inside of the header which had to be manually removed. I also found bits and pieces of left over H1 titles that also needed to be removed.

On a few of my test pages, I had some CSS style elements in the header, and these needed to either be incorporated into my style sheet, a style sheet of their own, or eliminated. I decided to put them into a style sheet of their own, since they apply to a specific category of articles, and maybe I can still use these when I get ready for the design work. So I cut and pasted them all into a blank css file.

I also have a lot of lists of links to pages within my site set up in a box on each page within a series, so all the series pages are linked together. With PHP, I can bring these in without hard coding, so these had to go. As I stumbled upon these in the file, I could quickly do a search and replace to remove them since they were the same on all of the involved files. Since they are unique to a series of articles, I’d forgotten about them while doing the other major search and replaces, but I could easily catch them now.

You, too, will find elements you can do without, unique to your own design. Make notes so you can fix these things as you move through your different folders.

As I went through the new single file, I put in dates, trying to keep some kind of order to them, taking the WordPress date controls into mind, and randomly choosing dates to get the material in some kind of ordered.

I also had to manually move the EXCERPT from above the TITLE where it sat during the multiple file cleanup, to below the BODY field, keeping things in order.

After removing all the junk, moving the EXCERPT, and setting the dates and categories, I needed to get rid of the empty spaces, so a few search and replaces for line breaks (hard returns) and double and triple space bars and tabs, the file was cleaned up and ready for a final last manual inspection before importing the test files into the database.

Prepare to Engage Warp Engines

I was very proud of myself and all my hard work, which in the end didn’t really take that long, once I got the hang of it. The macro I created with WordPerfect really cleaned things up and saved a lot of time. Too bad I can’t use that across multiple files! My last manual check of everything caught a few more little buggers, but they were quickly cleaned up. All the records (pages) were divided with the 8-dash lines and the fields cleanly separated by five-dash lined…I was ready to go.

I saved the file to a text file and uploaded it to my wordpress/wp-admin/ folder on my website where I’d installed my test version of WordPress. I then typed in the URL in my browser of the import-mt.php which resides in the wordpress/wp-admin/ folder and started praying.

Now, I have to be honest with you. You come here because you expect honesty from Lorelle. She won’t lie to you. She’ll exaggerate and say something she believes to be true, until proven otherwise, and then freely admit that she screwed up. So you will get the truth here.

Yes, I screwed up and had to try this import-mt business several times before I figured out that 1) the order of the fields in the import file mattered, 2) a stray bit of code left it will cut your import short very quickly, and 3) everything you’ve read up until this point includes the fixes I found AFTER I reached this point. So you are getting the truth about how I did this, but the edited version. I didn’t want anyone following along and then emailing me that I was some kind of idiot because it didn’t work. I discovered it didn’t work and had to fix it – so you are getting the benefit of my aches and pains, trials and errors.

But don’t worry. The next entry covers my trials and tribulations, so maybe someone can learn from them.

So, once I’d done it “right”, my test web pages were now imported into the database and WordPress could find them, and now it was time to really see what had gone in, and what had not, and to start to get to work on designing the website so all of this work would look pretty when finally finished.

Post a Comment

Your email is kept private. Required fields are marked *