0

My program does all that I want, but is not saving the final data to the csv file, I used a print before it to see if the data was right and it is, It is just not writing to the csv file, I'm using 'a' because I don't want it to rewrite what's already written, but it is still returning an error.

here's the part of the code:

 soup = BeautifulSoup(answer)
                    for table in soup.findAll('table', {"class":"formTable"}):
                        for row in table.findAll('tr'):
                            #heading = row.find('td', {"class":"sectionHeading"})
                            #if heading is not None:
                                #print(heading.get_text());
                            #else:
                             label = row.find('td', {"class":"fieldLabel"})
                             data = row.find('td', {"class":"fieldData"})
                             if data is not None and label is not None:
                                        csvline += label.get_text() + "," + data.get_text() + ","
                    print(csvline)
                    #csvline.encode('utf-8')
                    with open ('output_file_two.csv', 'a', encoding='utf-8') as f:
                        writer = csv.writer(f)
                        writer.writerow(csvline)

Here's the error:

Traceback (most recent call last):
  File "C:\PROJECT\pdfs\final.py", line 95, in <module>
    with open ('output_file_two.csv', 'a', encoding='utf-8') as f:
TypeError: 'encoding' is an invalid keyword argument for this function

Here's the entire program code in case of need

import shlex
import subprocess
import os
import platform
from bs4 import BeautifulSoup
import re
#import unicodecsv as csv
import csv
#import pickle
import requests
from robobrowser import RoboBrowser
import codecs

def rename_files():
    file_list = os.listdir(r"C:\\PROJECT\\pdfs")
    print(file_list)
    saved_path = os.getcwd()
    print('Current working directory is '+saved_path)
    os.chdir(r'C:\\PROJECT\\pdfs')
    for file_name in file_list:
        os.rename(file_name, file_name.translate(None, " "))
    os.chdir(saved_path)
rename_files()

def run(command):
    if platform.system() != 'Windows':
        args = shlex.split(command)
    else:
        args = command
    s = subprocess.Popen(args,
                         stdout=subprocess.PIPE,
                         stderr=subprocess.PIPE)
    output, errors = s.communicate()
    return s.returncode == 0, output, errors

# Change this to your PDF file base directory
base_directory = 'C:\\PROJECT\\pdfs'
if not os.path.isdir(base_directory):
    print "%s is not a directory" % base_directory
    exit(1)
# Change this to your pdf2htmlEX executable location
bin_path = 'C:\\Python27\\pdfminer-20140328\\tools\\pdf2txt.py'
if not os.path.isfile(bin_path):
    print "Could not find %s" % bin_path
    exit(1)
for dir_path, dir_name_list, file_name_list in os.walk(base_directory):
    for file_name in file_name_list:
        # If this is not a PDF file
        if not file_name.endswith('.pdf'):
            # Skip it
            continue
        file_path = os.path.join(dir_path, file_name)
        # Convert your PDF to HTML here
        args = (bin_path, file_name, file_path)
        success, output, errors = run("python %s -o %s.html %s " %args)
        if not success:
            print "Could not convert %s to HTML" % file_path
            print "%s" % errors
htmls_path = 'C:\\PROJECT'
with open ('score.csv', 'w') as f:
    writer = csv.writer(f)
    for dir_path, dir_name_list, file_name_list in os.walk(htmls_path):
        for file_name in file_name_list:
            if not file_name.endswith('.html'):
                continue
            with open(file_name) as markup:
                soup = BeautifulSoup(markup.read())
                text = soup.get_text()
                match = re.findall("PA/(\S*)", text)#To remove the names that appear, just remove the last (\S*), to add them is just add the (\S*), before it there was a \s*
                print(match)
                writer.writerow(match)
                for item in match:
                    data = item.split('/')
                    case_number = data[0]
                    case_year = data[1]
                    csvline = case_number + ","

                    browser = RoboBrowser()
                    browser.open('http://www.pa.org.mt/page.aspx?n=63C70E73&CaseType=PA')
                    form = browser.get_forms()[0]  # Get the first form on the page
                    form['ctl00$PageContent$ContentControl$ctl00$txtCaseNo'].value = case_number
                    form['ctl00$PageContent$ContentControl$ctl00$txtCaseYear'].value = case_year

                    browser.submit_form(form, submit=form['ctl00$PageContent$ContentControl$ctl00$btnSubmit'])

                    # Use BeautifulSoup to parse this data
                    answer = browser.response.text
                    #print(answer)
                    soup = BeautifulSoup(answer)
                    for table in soup.findAll('table', {"class":"formTable"}):
                        for row in table.findAll('tr'):
                            #heading = row.find('td', {"class":"sectionHeading"})
                            #if heading is not None:
                                #print(heading.get_text());
                            #else:
                             label = row.find('td', {"class":"fieldLabel"})
                             data = row.find('td', {"class":"fieldData"})
                             if data is not None and label is not None:
                                        csvline += label.get_text() + "," + data.get_text() + ","
                    print(csvline)
                    with open ('output_file_two.csv', 'a') as f:
                        writer = csv.writer(f)
                        writer.writerow(csvline)

EDIT

It's working, here's the code working

import shlex
import subprocess
import os
import platform
from bs4 import BeautifulSoup
import re
import unicodecsv as csv
import requests
from robobrowser import RoboBrowser
import codecs

def rename_files():
    file_list = os.listdir(r"C:\\PROJECT\\pdfs")
    print(file_list)
    saved_path = os.getcwd()
    print('Current working directory is '+saved_path)
    os.chdir(r'C:\\PROJECT\\pdfs')
    for file_name in file_list:
        os.rename(file_name, file_name.translate(None, " "))
    os.chdir(saved_path)
rename_files()

def run(command):
    if platform.system() != 'Windows':
        args = shlex.split(command)
    else:
        args = command
    s = subprocess.Popen(args,
                         stdout=subprocess.PIPE,
                         stderr=subprocess.PIPE)
    output, errors = s.communicate()
    return s.returncode == 0, output, errors


base_directory = 'C:\\PROJECT\\pdfs'
if not os.path.isdir(base_directory):
    print "%s is not a directory" % base_directory
    exit(1)

bin_path = 'C:\\Python27\\pdfminer-20140328\\tools\\pdf2txt.py'
if not os.path.isfile(bin_path):
    print "Could not find %s" % bin_path
    exit(1)
for dir_path, dir_name_list, file_name_list in os.walk(base_directory):
    for file_name in file_name_list:

        if not file_name.endswith('.pdf'):

            continue
        file_path = os.path.join(dir_path, file_name)

        args = (bin_path, file_name, file_path)
        success, output, errors = run("python %s -o %s.html %s " %args)
        if not success:
            print "Could not convert %s to HTML" % file_path
            print "%s" % errors
htmls_path = 'C:\\PROJECT'
with open ('score.csv', 'w') as f:
    writer = csv.writer(f)
    for dir_path, dir_name_list, file_name_list in os.walk(htmls_path):
        for file_name in file_name_list:
            if not file_name.endswith('.html'):
                continue
            with open(file_name) as markup:
                soup = BeautifulSoup(markup.read())
                text = soup.get_text()
                match = re.findall("PA/(\S*)", text)
                print(match)
                writer.writerow(match)
                for item in match:
                    data = item.split('/')
                    case_number = data[0]
                    case_year = data[1]
                    csvline = case_number + ","

                    browser = RoboBrowser()
                    browser.open('http://www.pa.org.mt/page.aspx?n=63C70E73&CaseType=PA')
                    form = browser.get_forms()[0]  
                    form['ctl00$PageContent$ContentControl$ctl00$txtCaseNo'].value = case_number
                    form['ctl00$PageContent$ContentControl$ctl00$txtCaseYear'].value = case_year

                    browser.submit_form(form, submit=form['ctl00$PageContent$ContentControl$ctl00$btnSubmit'])


                    answer = browser.response.text
                    soup = BeautifulSoup(answer)
                    for table in soup.findAll('table', {"class":"formTable"}):
                        for row in table.findAll('tr'):
                             label = row.find('td', {"class":"fieldLabel"})
                             data = row.find('td', {"class":"fieldData"})
                             if data is not None and label is not None:
                                csvline += label.get_text() + "," + data.get_text() + ","
                                print(csvline)
                                my_file = codecs.open('final_output.csv', 'a', 'utf-8')
                                my_file.write(csvline)
fsgdfgsd
  • 23
  • 1
  • 8

2 Answers2

0

At the end there is a problem with your code

writer = csv.writer(f)
csv.writer(csvline) # here is the problem

See you initialize the writer, but then you don't use it.

writer = csv.writer(f)
writer.writerow(csvline)
VKolev
  • 855
  • 11
  • 25
  • Now it is promping another error Traceback (most recent call last): File "C:\PROJECT\pdfs\final.py", line 103, in writer.writerow(csvline) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128) – fsgdfgsd May 05 '17 at 08:18
  • You should see how to encode your data to Unicode `csvline.encode('utf-8')` or set the file to utf-8 encoding `with open('....csv', 'w', encoding='utf-8') as f` – VKolev May 05 '17 at 08:21
  • Witch python version are you using? I asumed, that you are using Python 3.x If you are using Python 2.x then you should look at another solution, but that has not much to do with writing to CSV file. It is more of an encoding problem, and there are plenty of answers here at stackoverfloy See here: http://stackoverflow.com/questions/18766955/how-to-write-utf-8-in-a-csv-file – VKolev May 05 '17 at 08:35
  • Yeah, I'm using python 2.7 – fsgdfgsd May 05 '17 at 08:40
  • Well that's why you get the `TypeError` when adding `encoding='utf-8' to the open function. See the link in previous comment. – VKolev May 05 '17 at 08:42
0

Here :

with open ('output_file_two.csv', 'a') as f:
    writer = csv.writer(f)
    csv.writer (csvline)

You are instanciating a csv.writer, but not using it. This should read:

with open ('output_file_two.csv', 'a') as f:
    writer = csv.writer(f)
    writer.write(csvline)

Now there are quite a few other problems with your code, the first one being to manually create the 'csvline as text then using csv.writer to store it to file. csv.writer.write() expects a list of rows (tuples) and takes care of properly escaping what needs to be escaped, inserting the proper delimiters etc. It also has a writerow() method that takes a single tuple and so avoid building the whole list in memory FWIW.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • But when I tried to do as you wrote it still promped an error about the encoding, I tried to use encode uft-8 and it is promping another error – fsgdfgsd May 05 '17 at 08:32
  • @Samuel that's another problem. You (obviously) have to first encode the text parts of your rows to the desired encoding. If you do have problems with this please first learn about unicode and encodings (a required knowledge nowadays) then post another question (with all relevant details) if you still have some errors. – bruno desthuilliers May 05 '17 at 11:39