Batch Convert HTML Files to CSV
This guide will walk you through the process of converting multiple HTML files, each containing tabular data about different species of fungi, into CSV files. Instead of processing each file individually, we'll automate the task using a Python script.
Prerequisites
- Python installed on your machine
- BeautifulSoup library for HTML parsing (install via
pip install beautifulsoup4)
Step-by-Step Guide
Set Up Your Directory: Ensure all your HTML files are located in a specific directory. For this example, we will use
../html/species.Python Script: Below is a Python script that will read each HTML file, extract the table data, and save it as a CSV file.
import os
import csv
from bs4 import BeautifulSoup
# Define the directory containing HTML files
html_directory = r'../html/species'
# Loop through each file in the directory
for filename in os.listdir(html_directory):
if filename.endswith('.html'):
# Construct full file path
file_path = os.path.join(html_directory, filename)
with open(file_path, 'r') as file:
# Parse the HTML content
soup = BeautifulSoup(file.read(), 'html.parser')
# Find the first table in the HTML
table_data = soup.find_all('table')[0].find_all('tr')[1:] # Skip header row
# Prepare to write to CSV
csv_filename = filename.replace('.html', '.csv')
with open(csv_filename, 'w', newline='') as csv_file:
writer = csv.writer(csv_file)
# Extract data from each row
for row in table_data:
row_data = [cell.get_text(strip=True) for cell in row.find_all('td')]
writer.writerow(row_data) # Write row to CSV
Explanation of the Script
- Directory Setup: The script begins by defining the directory where the HTML files are stored.
- File Loop: It iterates through each file in the directory, checking if the file ends with
.html. - HTML Parsing: For each HTML file, it reads the content and uses BeautifulSoup to parse the HTML and find the first table.
- CSV Writing: It creates a new CSV file for each HTML file, extracts the data from the table rows, and writes it to the CSV file.
Conclusion
By using this script, you can efficiently convert multiple HTML files into CSV format without the need to manually process each file. This automation can save you significant time and effort, especially when dealing with large datasets.