0

I am new to the text mining. I have a CSV file. I need to go through each line and extract some information then write them into another CSV file. I am looking for specific information which I have in a dictionary. Consider below sentence:

"the application version is 1.8.2 and the variable skt.len passes the required information. file ReadMe.txt has the specifications."

My dictionary is: ["application version", "variable", "file"]

I need to extract:

  • application version: 1.8.2
  • variable: skt.len
  • file: ReadMe.txt

What is the best way to extract such information from text? I am playing with NLTK and StanfordCoreNLP features. But, I could not extract the information yet. I am thinking to use regex to extract the application version. Any idea?

PS: I know that this may make the task more complicated. But, sentences in each line of the CSV file may have different structures. For example: "application version" in one line, may be "app version" in another line. Or "file" in one line may be "filename" in another line.

Mahhos
  • 101
  • 1
  • 1
  • 12
  • Please show us some example input and the desired output. – Klaus D. Sep 02 '18 at 05:37
  • Welcome to Stack Overflow! Please take the [tour] and read through the [help], in particular [how-to-ask] Your best bet here is to do your research, search for related topics on SO, and give it a go. If you get stuck and can't get unstuck after doing more research and searching, post a [mcve] of your attempt and say specifically where you're stuck. You could use pandas. There are plenty of other python packages that can read and write csv. A previous question on panadas and NLTK https://stackoverflow.com/questions/34784004/python-text-processing-nltk-and-pandas – lxx Sep 02 '18 at 05:44
  • For example: JavaScript in Internet Explorer 3.x and 4.x, and Netscape 2.x, 3.x and 4.x. – Mahhos Sep 02 '18 at 15:59
  • input is text strings stored in CSV file. For example: "JavaScript in Internet Explorer 3.x and 4.x" and we need to extract some information like the application version, any file name, variable name that exists in the text. these keyword ["application version", "variable", "file"] can be explicitly or implicitly mentioned in the input. In this example 3.x and 4.x are outputs as application version. The program should be able to detect them and extract. it can work with POS tree and focuses on only NP (noun phrase) of the parse tree to extract these information. – Mahhos Sep 02 '18 at 16:07
  • You can use regular expression to get desirable value after a specific text match i.e. Version or Variable. – Sourabh Sep 04 '18 at 11:20

1 Answers1

1

I use R and below is one of the way (not the best one but just to show how it works) to extract value of variable:

>> str_extract(text, '(?<=variable\\s)(\\w+)(.)?(\\w+)?')

here text is the entire string which you have shared. This gives me an output

>> skt.len

I am sure there are similar functions in Python to get this done and get the output in desired format.

Sourabh
  • 73
  • 1
  • 18
  • Thanks. But the task is more complex than using regular expressions. Since, we have unstructured data, we do not know how each description is. The program should understand that, for example in the description below, myBloggie is the application name and 2.1.6 is its version. I can handle the version numbers by REGEX, but it is not applicable for other info that I need to extract. EXAMPLE: "Multiple SQL injection vulnerabilities in myBloggie 2.1.6 and earlier allow remote attackers to execute arbitrary SQL commands via the (1) cat_id or (2) year parameter to index.php in a viewuser action ". – Mahhos Sep 04 '18 at 19:18