1

I have a parent directory with subdirectories, each one of these contains a .html file on which I want to run my code. This takes an html file and will export a respective csv file with table data.

I have tried two main approaches but neither work appropriately because it's not able to find the .html file accordingly (non-existent). Note: The name for each file in sub directory will always be index.html

Linux Command Line (Based using code1)

for file in */; do for file in *.html; do python html_csv2.py "$file"; done; done

Code 1:

name = 'index.html'
html = utils.getFileContent(name)
#Get data from file
doc = SimplifiedDoc(html)
soup = bs(html, 'lxml')

title = (soup.select_one('title').text)
title = title.split(' -')
strain = title[0]
rows = []
tables = doc.selects('table.region-table')
tables = tables[:-1]
#print (type(tables))
for table in tables:
    trs = table.tbody.trs
    for tr in trs:
        rows.append([td.text for td in tr.tds])
#print(rows)
#print(type(rows))
#print("PANDAS DATAFRAME")
df_rows = pd.DataFrame(rows)
df_rows.columns = ['Region', 'Class', 'From', 'To', 'Associated Product', 'Class', 'Similarity']
df_rows['Strain'] = strain
df_rows = df_rows[['Strain','Region', 'Class', 'From', 'To', 'Associated Product', 'Class', 'Similarity']] 
#print(df_rows)
df_rows.to_csv (r'antismash_html.csv', index = False, header=True)
print('CSV CREATED')

In this second snippet I'm trying to use the os library to go into each sub-directory accordingly.

Code 2:

import csv
from simplified_scrapy import SimplifiedDoc,req,utils
import sys
import pandas as pd
import lxml.html
from bs4 import BeautifulSoup as bs
import os

name = 'index.html'
html = utils.getFileContent(name)
# Get data from file
doc = SimplifiedDoc(html)
soup = bs(html, 'lxml')

cwd = os.getcwd()
print(cwd)
directory_to_check = cwd # Which directory do you want to start with?

def directory_function(directory):
      print("Listing: " + directory)
      print("\t-" + "\n\t-".join(os.listdir("."))) # List current working directory

# Get all the subdirectories of directory_to_check recursively and store them in a list:
directories = [os.path.abspath(x[0]) for x in os.walk(directory_to_check)]
directories.remove(os.path.abspath(directory_to_check)) #Dont' want it done in my main directory

def csv_create(name):
    title = (soup.select_one('title').text)
    title = title.split(' -')
    strain = title[0]
    rows = []
    tables = doc.selects('table.region-table')
    tables = tables[:-1]
    #print (type(tables))
    for table in tables:
        trs = table.tbody.trs
        for tr in trs:
            rows.append([td.text for td in tr.tds])
    #print(rows)
    #print(type(rows))
    #print("PANDAS DATAFRAME")
    df_rows = pd.DataFrame(rows)
    df_rows.columns = ['Region', 'Class', 'From', 'To', 'Associated Product', 'Class', 'Similarity']
    df_rows['Strain'] = strain
    df_rows = df_rows[['Strain','Region', 'Class', 'From', 'To', 'Associated Product', 'Class', 'Similarity']] 
    #print(df_rows)
    df_rows.to_csv (r'antismash_html.csv', index = False, header=True)
    print('CSV CREATED')
    #with open(name +'.csv','w',encoding='utf-8') as f:
    #    csv_writer = csv.writer(f)
    #    csv_writer.writerows(rows)

for i in directories:
      os.chdir(i)         # Change working Directory
      csv_create(name)      # Run your function


directory_function
#csv_create(name)

I tried using the example here: Python: run script in all subdirectories but was not able to execute accordingly.

2 Answers2

2

Alternatively, you could consider using glob.glob(). But be careful to search from the folder you intend to by specifying your path in the glob expression - or cd'ing into the folder.

glob will give you a flat list of relative paths.

>>> import glob
>>> 
>>> files = glob.glob('**/*.py', recursive=True)
>>> len(files)
3177
>>> files[0]
'_wxWidgets-3.0.2/build/bakefiles/wxwin.py'
>>> 

Doc is here with some glob expression examples: https://docs.python.org/3.5/library/glob.html

If you start glob off on a recursive search on your drive from a folder that has a lot of nested subfolders, it'll bog the interpreter down until it completes - or you kill the session.

Todd
  • 4,669
  • 1
  • 22
  • 30
  • Thanks for the follow up, would it be possible to specify just for .html files or since I have a fixed name on for the name of my file input `(name = 'index.html')`, I can just incorporate this glob command? – bioinformatics_student Mar 06 '20 at 11:32
  • Yeah.. if i'm understanding you correctly. If you wanted relative paths to all html files under a certain folder, `'**/*.html'` would be a basic glob pattern for that. Or `/users/me/projects/web/**/*.html` – Todd Mar 06 '20 at 11:39
  • It will return a list of my subdirectories with the respective file, `['index3/index.html', 'index1/index.html', 'index2/index.html']`. For which it does not recognize the object _name_ as a file itself giving error `TypeError: expected str, bytes or os.PathLike object, not list` – bioinformatics_student Mar 09 '20 at 09:58
  • @Biohacker the error indicates that you're not comparing a string to another like object. Check your code to make certain you're comparing a string to a string. – Todd Mar 09 '20 at 10:02
1

Try this.

import os
from simplified_scrapy import utils
def getSubDir(name,end=None):
  filelist = os.listdir(name)
  if end:
    filelist = [os.path.join(name,l) for l in filelist if l.endsWith(end)]
  return filelist
subDir = getSubDir('./') # The directory which you want to start with
for dir in subDir:
  # files = getSubDir(dir,end='index.html')
  fileName = dir+'/index.html'
  if not os.path.isfile(fileName): continue
  html = utils.getFileContent(fileName)
dabingsou
  • 2,469
  • 1
  • 5
  • 8
  • thank you very much you had assisted me before with html_extraction. when I use this line of code it only does it for one specific file. I tried putting the _csv-create_ function both in and out of the for loop. However it only creates the csv file for the last directory in the subdirectory list and not for each respective subdirectory. `['html_csv.py', 'index3', 'html_csv2.py', 'antismash_html.csv', 'html2csv.py', 'index1', 'index2']` – bioinformatics_student Mar 09 '20 at 10:49
  • 1
    @Biohacker I updated the answer. See if it's what you want – dabingsou Mar 09 '20 at 11:05
  • I think I did not clarify properly, I just want to keep as originally mentioned just one main csv file that has all the tables from the original html file. In other words I just want to have in each sub-directory the new csv file (with all tables in same csv) with its respective html. Basically each subdirectory will have one html and one csv file – bioinformatics_student Mar 09 '20 at 11:27
  • @Biohacker Is there any problem with this? Paste your directory structure and I'll help you change the code. – dabingsou Mar 09 '20 at 14:03
  • for some reason it gives me errors as there are subdirectories within my subdirectories so it continues to search for the index.html in the subdirectories despite it already being found, I just want it to go to my first subdirectory and then back to the parent directory and then do this continuously. Here is my code pastebin.com/M44iMGbL – bioinformatics_student Mar 09 '20 at 15:51
  • the structure more or less looks like this: _**Parent-Directory**: **Sub1**: index.html; Subsub1: other files; **Sub2**: index.html; Subsub2: other files; **Subn**: index.html; Subsubn: other files;_ – bioinformatics_student Mar 09 '20 at 15:53
  • @Biohacker The above code only takes one level of subdirectory, and there is also a judgment condition 'if not os.path.isfile(fileName): continue'. Did I neglect anything? – dabingsou Mar 10 '20 at 01:04
  • I don't know if it's the way I'm employing my functions, I have used your code above and it does go into every subdirectory but it only creates one csv file and not one for each sub directory (it continues to overwrite the csv file). here is the code I am using: https://pastebin.com/dShqa8iH – bioinformatics_student Mar 10 '20 at 09:51
  • I was able to correct this after all, thanks once again for your help :) – bioinformatics_student Mar 10 '20 at 11:26