0

I am complete beginner in Python. I've imported a CSV file into Python. It is 1618 rows x 1 columns. Essentially, I want to keep 2 recurring rows of data throughout the data frame. I would like to do this by deleting all rows that do not contain the following text:

1) starts with a space and 9 following digits at the beginning of the row (Ex: "123456789")

2) has a row that contains any of the following digits ("2000", "2001", ..., "2020")

So basically, I would be left with two types of rows however amount of times they appear in the data frame:

1) With a space and 9 digits following

2) with any row containing "2000", all the way up to "2020"

Any help writing this would be amazing and greatly appreciated. I am looking to learn more and be able to do all of this independently.

UPDATE: Hey thank you all for the help... I will provide some lines that print from the CSV for clarification:

11 XXXXXX ...

12 NAME: ABC

13 ----------------------------------------------...

14 XXX...

15 123456789 - - .0000 ...

16 -------------------------------------...

17 G52 0000000000000000000000...

18 G53 XXX 09132017 ...

NOTE: Please disregard the strange lines with X's and dashes... the data comes from another program. Line 18 contains the date which would be found by the year "2017", and line 15 contains the beginning space and 9 digits. If any more information would help, feel free to let me know. Thank you!

nick_zam
  • 1
  • 1
  • 1
    Please provide a small set of sample data as text that we can copy and paste. Include the corresponding desired result. Check out the guide on [how to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/3620003). – timgeb Jun 06 '20 at 17:07
  • 1
    Thank you for the help! Just updated the post, I'll take a look at the link. – nick_zam Jun 07 '20 at 20:49

2 Answers2

0

This is two conditions filter with match and contains

con1=df['col1'].str.match('(\s*)?(\d{9})')
con2=df['col1'].str.contains('2000|2001')
yourdf=df[~(con1|con2)]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Hey, thank you! I will give it a go as soon as I get a chance. I gave an update to my initial post so maybe you'd want to take a look. Appreciate it! – nick_zam Jun 07 '20 at 20:48
0

Try:

df=df.loc[df["x"].str.match(r"^(\s*)((\d{9})|(.*20[0-2]\d.*))$")]

x being your input column, and df being your dataframe.

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34