Converting Word documents to HTML using c#
If you have a stack of MS Word docs and you want them all to be converted to
HTML, why not create a .net application and do the conversion automatically?
The actual reason for creating this piece of code was that I wanted my ASP.NET
2 website to allow people to upload Word docs and then to display them as web
pages, converting them automatically to HTML and making them available for everyone
to read, whether they has Word or not.
It turns out that this is pretty simple, because MS Word has a "save as
html" feature that you can use. Now, you all hate that feature because
it creates bad HTML, but in fact you can help it to do a better job, so read
on...
What I did was, make sure that Word was installed on the webserver, then link
to an instance of Word using COM objects, upload the doc and put it into Word,
then call the "save as html" feature to output the html. Here is how:
- Make sure that Word is installed on the webserver, or else ASP.NET can't
call Word's functions, obviously. It also helps if you run Word and try it
out on the server, or else you get problems with "the very first time
that Word runs it does something weird "scenario. In fact it needs to
ask you for your initals on the first run, I don't know why.
- A a reference in your project to MS Word. This is done from Solution Explorer,
right click and select "add reference" then on the COM tab find
your Word library. Mine is called Microsoft Word 11 Object Library, but different
versions of Word will be different.
- By this point the Word namespace will be available to you as Microsoft.Office.Word.
- Now make a project with a FileUpload control and a Button to start the conversion
process. Here is the code to get the file uploaded from the user's browser
into a temp directory on the server:
if (FileUpload1.HasFile)
{
// Upload the file to a temporary folder...
string folder_to_save_in = @"c:\temp\docs\";
string filePath = folder_to_save_in + FileUpload1.FileName;
FileUpload1.SaveAs(filePath);
Now we create a Word COM object. The Open command requires lots of parameters,
but they can almost all be left blank. Except that you must provide System.Reflection.Missing.Value
to them, or the compiler complains at you:
Word.ApplicationClass wordApplication = new Word.ApplicationClass();
// Set up an object to hold a missing value...
object o_nullobject = System.Reflection.Missing.Value;
object o_filePath = filePath;
Word.Document doc = wordApplication.Documents.Open(ref o_filePath,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref
o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref
o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref
o_nullobject);
Now save it in HTML format by using the appropriate switch in the Save command:
string newfilename = folder_to_save_in + FileUpload1.FileName.Replace(".doc",
".html");
object o_newfilename = newfilename;
object o_format = Word.WdSaveFormat.wdFormatHTML;
object o_encoding = Microsoft.Office.Core.MsoEncoding.msoEncodingUTF8;
object o_endings = Word.WdLineEndingType.wdCRLF;
// SaveAs requires lots of parameters, but we can leave most of them empty:
wordApplication.ActiveDocument.SaveAs(ref o_newfilename, ref o_format, ref o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_nullobject, ref o_nullobject, ref
o_nullobject, ref o_nullobject,
ref o_nullobject, ref o_nullobject, ref o_encoding, ref o_nullobject,
ref o_nullobject, ref o_endings, ref o_nullobject);
In this example I set it up to encode into UTF8 and use Carriage Return Line
Feed line endings, but appart from that there are some other options that you
can find here:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd11/html/womthSaveAs1_HV05213080.asp
Finally we close the Word Doc, then the Application, and finally discard the
COM object:
doc.Close(ref o_null, ref o_null, ref o_null);
wordApplication.Quit(ref o_null, ref o_null, ref o_null);
System.Runtime.InteropServices.Marshal.ReleaseComObject(wordApplication);
Now the other issue is to go into the HTML file that has been created and delete
a whole load of stupid codes that MS Word puts in there for no reason at all,
but I'll leave that bit up to you...
|