0

The line input is like this:

-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf

Required output is:

25399 Nov  2 21:25 exception_hierarchy.pdf

which is size, month, day, hour, minute and filename respectively.

The question asks to return a list of tuples (size, month, day, hour, minute, filename) using regular expressions to do this (either match, search, findall, or finditer method).

My code that I tried is -

for line in range(1):
    line=f.readline()
x=re.findall(r'[^-]\d+\w+:\w+.*\w+_*',line)
    print (x)

My output - [' 21:25 add_colab_link.py']
SNEHA
  • 13
  • 1
  • 6
  • This appears to be a homework question. If so, Stackoverflow has a rule that the person asking the question must demonstrate that they have attempted to solve the problem themselves. – Cary Swoveland Mar 18 '20 at 18:18
  • I tried doing it with first line only initially also Ignore the multiple prints as I was trying to see where I am heading-f=open("C:\Sneha_Python\hy-data-analysis-with-python-spring-2020\part02-e02_file_listing\src\listing.txt","r") for line in range(1): line=f.readline() print(line[1]) x=re.findall(r'[^-]\d+\w+:\w+.*\w+_*',line) regex=re.compile('[a-z][0-9]',re.IGNORECASE) y=regex.findall(line) print(y) print (x) I got output as- [' 11:50 add_colab_link.py'] Not the entire line, I am a beginner. – SNEHA Mar 18 '20 at 18:35
  • Please modify your question instead, as code doesn't format properly in comments as you can see. – arnaud Mar 18 '20 at 18:36
  • Sorry but this is the format of the line. There are multiple lines like this. Each line has seven fields. To make it clear I will add a comma after each word. But that is not the original format though. -rw-r--r-- 1, jttoivon, hyad-all, 25399, Nov 2, 21:25, exception_hierarchy.pdf – SNEHA Mar 18 '20 at 18:39
  • can you edit your question and format code as code? – Arco Bast Mar 18 '20 at 18:51
  • I understood what's your input, and also what's your output. Please edit your question to include what you have tried, using proper code formatting (this is unreadable as a comment) – arnaud Mar 18 '20 at 18:51

2 Answers2

0

please have a read of the following example on how to ask great questions: How to make a great R reproducible example

I answer your question because not long ago I did the same mistakes and I was happy if someone still answered.

import re  # import of regular expression library

# I just assume you had three of those pieces in one list:
my_list = ["-rw-r--r-- 1 jttoivon hyad-all 12345 Nov 2 21:25 exception_hierarchy.pdf", "-rw-r--r-- 1 jttoivon hyad-all 25399 Nov 2 21:25 exception_hierarchy.pdf", "-rw-r--r-- 1 jttoivon hyad-all 98765 Nov 2 21:25 exception_hierarchy.pdf"]

# I create a new list to store the results in
new_list = []

# I produce this loop to go through every piece in the list:
for x in my_list:
    y = re.findall("([0-9]{5}.+pdf)", x) # you can check the meaning of the symbols with a simple google search
    for thing in y:
        a, b, c, d, e = thing.split(" ")
        g, h = d.split(":")
        z = (a, b, c, g, h, e)
        new_list.append(z)

print(new_list)
Andreas G.
  • 190
  • 11
  • Thank you for the response. It was helpful but the output formatting is changed particularly for time even though you have given (:) split. – SNEHA Mar 19 '20 at 13:46
  • welcome. Not sure though what you mean with the time as the output corresponds to the output you wanted. Anyhow, I am sure you can adjust as you see fit. – Andreas G. Mar 19 '20 at 14:55
0

Here's a working example using regular expressions thanks to package re:

>>> import re
>>> line = "-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf"
>>> pattern = r"([\d]+)\s+([A-z]+)\s+(\d{1,2})\s+(\d{1,2}):(\d{1,2})\s+(.+)$"
>>> output_tuple = re.findall(pattern, line)[0]
>>> print(output_tuple)
('25399', 'Nov', '2', '21', '25', 'exception_hierarchy.pdf')
>>> size, month, day, hour, minute, filename = output_tuple

Most of the logic is encoded in the raw pattern variable. It's very easy though if you look at it piece by piece. See below, with new lines to help you read through:

([\d]+)    # means basically group of digits (size)
\s+        # means one or more spaces
([A-z]+)   # means one or more letter (month)
\s+        # means one or more spaces
(\d{1,2})  # one or two digits (day)
\s+        # means one or more spaces
(\d{1,2})  # one or two digits (hour)
:          # looking for a ':'
(\d{1,2})  # one or two digits (minute)
\s+        # means one or more spaces
(.+)       # anything basically
$          # until the string ends

By the way, here's a working example not using re (because it's actually not mandatory here):

>>> line = "-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf"
>>> size, month, day, hour_minute, filename = line.split("hyad-all")[1].strip().split()
>>> hour, minute = hour_minute.split(":")
>>> print(size, month, day, hour, minute, filename)
25399 Nov 2 21 25 exception_hierarchy.pdf
arnaud
  • 3,293
  • 1
  • 10
  • 27
  • It is very well explained by you..amazing..thanks a ton. However, I have a question-How come using '\d{1,2}' does not pick digit 1 in "-rw-r--r-- 1 jttoivon hyad-all" in the output? Just trying to understand how reg ex work. Do they work in a particular sequence/order? – SNEHA Mar 19 '20 at 13:21
  • You're welcome. If you were to look with `pattern = r"\d{1,2}"` you would indeed end up also retrieving the digit `1` in `-rw-r--r-- 1`. But in my given pattern, you're not looking only for a 1-2 digit pattern, you're looking for a sequence of patterns in a given order indeed. Regular expressions take into account the order given inside the pattern. You may have fun testing by yourself using tools like https://pythex.org/ (which is how I came up with the pattern by the way). If that answer suits your needs, don't hesitate accepting it please! – arnaud Mar 19 '20 at 14:15
  • See https://pythex.org/?regex=%5Cd%7B1%2C2%7D&test_string=-rw-r--r--%201%20jttoivon%20hyad-all%20%20%2025399%20Nov%20%202%2021%3A25%20exception_hierarchy.pdf&ignorecase=0&multiline=0&dotall=0&verbose=0 – arnaud Mar 19 '20 at 14:16
  • @SNEHA did that work out for you? Please accept the answer if it did. Best, – arnaud Mar 31 '20 at 13:41