Ipynb To Markdown

README is typically the front page of a project, and should contain relevant information for current users & prospective users. As to make sure documentation across a project is consistent as well, imagine if we could include this README that is the front page of our project, both on the repository, and in the documentation. This post goes into how to set this workflow up. Find a live example of this being implemented on: https://github.com/JackMcKew/pandas_alive.

A good starting structure for a project's README is:

Intro - A short description & output (if applicable) of the project.
Usage - A section on how the project is to be used (if applicable).
Documentation - Link to documentation for the project.
Contributing Guidelines - If this is an open source project, a note whether contributions are welcome & instructions how to get involved is well received.
Changelog - Keeping a changelog of what is changing as the project evolves.

Other useful sections when applicable are requirements, future plans and inspiration.

You could convert ipynb to a lots of format in Jupyter on web browser. File - Download as - PDF via LaTex or etc.

Inspiration for This Post

One of my biggest mistakes with this blog was not finding a WordPress plugin that would allow me to write my posts with markdown; to this day I still need to write posts in “Visual” mode and then.
In experimentation/Diabetes Ridge Regression Training.ipynb, complete the following steps: Create a function called splitdata to split the data frame into test and train data. The function should take the dataframe df as a parameter, and return a dictionary containing the keys train and test.

The inspiration for this post also comes from Pandas_Alive, wherein there is working examples with output hosted on the README. Initially, this was contained in a generate_examples.py file and as the package evolved, the code to match the examples, was being copied over into code blocks in the README.md. If you can see where this is going, obviously whenever some new examples were made, the code to generate the examples was being forgotten to be copied over. This is very frustrating for new users to the package, as the examples simply don't work. Thus the workflow we go into in this post was adopted.

README.ipynb

In projects, typically it's best practice to not have to repeat yourself in multiple places (this the DRY principle). In the README, it's nice to have working examples on how a user may use the project. If we could tie the original README with live code that generates the examples, that would be ideal, enter README.ipynb.

Jupyter supports markdown & code cells, thus all the current documentation in the README.md can be copied within markdown cells. Similarly, the code used to generate examples or demonstrate usage can then be placed in code cells. Allowing the author, to run the entire notebook, generating the new examples & verifying the examples are working code. Fantastic, this is exactly where we want to go.

Now if you only have the README.ipynb in the repository, GitHub will represent the file in it's raw form, JSON. For example would be hundreds of line like:

This is not ideal whatsoever, this is nowhere near as attractive as the nicely rendered README.md. Enter nbconvert.

README.ipynb -> README.md with nbconvert

nbconvert is a package built to convert Jupyter notebooks to other formats and can be installed similar to jupyter (eg, pip install jupyter, pip install nbconvert). See the documentation at: https://nbconvert.readthedocs.io/en/latest/.

Now let's check the supported output types for nbconvert:

HTML,
LaTeX,
PDF,
Reveal.js HTML slideshow,
Markdown,
Ascii,
reStructuredText,
executable script,
notebook.

nbconvert supports Markdown! Fantastic, we can add this step into our CI process (eg, GitHub Action). This will allow us to generate a new README.md whenever our README.ipynb changes.

In Pandas_Alive, we clear the output output of the cells in README.ipynb with the flags: jupyter nbconvert --ClearMetadataPreprocessor.enabled=True --ClearOutput.enabled=True --to markdown README.ipynb.

Python Highlighting in Output

When first run, it was noticed that nbconvert wasn't marking the code blocks with the language (python). This is required to highlight the code blocks in the README.md with language specifics. The workaround for this, was to use nbconvert's support for custom templates. See the docs at: https://nbconvert.readthedocs.io/en/latest/customizing.html#Custom-Templates.

The resulting template 'pythoncodeblocks.tpl' was:

Which could be used with nbconvert with:

Integration into Documentation with Sphinx

If you haven't already, check out my previous post Automatically Generate Documentation with Sphinx. The post goes into detail on how to implement Sphinx as to generate all of the documentation for a project from docstrings automatically.

Before going on, the live site of the documentation in reference can be reached at: https://jackmckew.github.io/pandas_alive/

Now, we've:

Stored our working code & documentation for a our project's front page in a Jupyter notebook README.ipynb
Converted README.ipynb into markdown format with nbconvert
Inserted language specific (python) into the code blocks within the markdown

The next step is to make the README content also live in the documentation.

Since Sphinx relies on reStructuredText format, so we'll need to convert README.md to README.rst. Enter m2r, a markdown to reStructuredText converter.

nbconvert could be used in this step over m2r, in saying that this step was originally developed prior to the README.ipynb being created, thus only README.md existed. Please drop a comment if you try using nbconvert over m2r for this step and your results!

Firstly, m2r can be installed with pip (pip install m2r) and we can convert README.md with the command m2r README.md which will generate README.rst in the same directory.

Now we need to include our README.rst in the documentation. After much tweaking, the documentation structure set up landed upon for Pandas_Alive, with use of autosummary to automatically generate documentation from docstrings was:

Autosummary generated documentation is included within a separate rst file (developer.rst) to nest all the generated with autosummary within one heading with the ReadTheDocs theme

Integration with GitHub Actions

All the steps above mentioned are currently being used to maintain the project Pandas_Alive.

Find the GitHub Action yml files at: https://github.com/JackMcKew/pandas_alive/tree/master/.github/workflows

Find the Sphinx configuration files at: https://github.com/JackMcKew/pandas_alive/tree/master/docs

Having migrated my site from hugo Academic to {{distill}} I wanted to see if I could fold-in python rendered notebooks with ease, and what the results would look like. Suffice it to say the process has been painless so far …

create the usual Rmd for the post,
export the ipynb notebook from jupyter as a markdown file (.md), and
copy-and-paste md file’s content into the post’s Rmd file,
add some css.

I had some pretty long and wide tables in the notebook that did not fit within the page-width and so I had to use some css to create a scrollable div ([@braican's solution found here](https://stackoverflow.com/a/17451132)) and voila!

We will start with importing some libraries we need and then play with some data to understand basic python commands. What data shall we work with? Well, let us pull down some data on criminal incidences that were reported.

First we install a particular library called pandas and in the command that follows, note that pd is just the alias that pandas assumes so that we can type pd and have all the pandas commands at our disposal.

The crime incident reports data are available here and span multiple years so we may end up working only with 2019 data but for now we proceed by gathering everything.

In the command below, the key part is pd.read_csv() and inside it is the URL for the comma-separated variable file. Once the file is downloaded by pandas we are saving it in python with the name df

Note that>INCIDENT_NUMBEROFFENSE_CODEOFFENSE_CODE_GROUPOFFENSE_DESCRIPTIONDISTRICTREPORTING_AREASHOOTINGOCCURRED_ON_DATEYEARMONTHDAY_OF_WEEKHOURUCR_PARTSTREETLatLongLocation0TESTTEST2423NaNASSAULT - AGGRAVATEDExternal02019-10-16 00:00:00201910Wednesday0NaNRIVERVIEW DRNaNNaN(0.00000000, 0.00000000)1S973337013301NaNVERBAL DISPUTEC691502020-07-18 14:34:0020207Saturday14NaNMARY BOYLE WAY42.330813-71.051368(42.33081300, -71.05136800)2S475131312647NaNTHREATS TO DO BODILY HARME1853002020-06-24 10:15:0020206Wednesday10NaNREADVILLE ST42.239491-71.135954(42.23949100, -71.13595400)3I921022013301NaNVERBAL DISPUTEE1358302019-12-20 03:08:00201912Friday3NaNDAY ST42.325122-71.107779(42.32512200, -71.10777900)4I920971733115NaNINVESTIGATE PERSONC1135502019-10-23 00:00:00201910Wednesday0NaNGIBSON ST42.297555-71.059709(42.29755500, -71.05970900)

What about the last 10 rows of the data?

INCIDENT_NUMBER	OFFENSE_CODE	OFFENSE_CODE_GROUP	OFFENSE_DESCRIPTION	DISTRICT	REPORTING_AREA	SHOOTING	OCCURRED_ON_DATE	YEAR	MONTH	DAY_OF_WEEK	HOUR	UCR_PART	STREET	Lat	Long	Location
515072	102095489	3115	NaN	INVESTIGATE PERSON	E18	520	0	2019-11-25 16:30:00	2019	11	Monday	16	NaN	HYDE PARK AVE	42.256215	-71.124019	(42.25621500, -71.12401900)
515073	102091671	2647	NaN	THREATS TO DO BODILY HARM	B3	417	0	2019-11-12 12:00:00	2019	11	Tuesday	12	NaN	MORA ST	42.282081	-71.073648	(42.28208100, -71.07364800)
515074	020224065	3018	NaN	SICK/INJURED/MEDICAL - POLICE	B2	282	0	2020-03-19 07:30:00	2020	3	Thursday	7	NaN	WASHINGTON ST	42.353272	-71.173738	(42.35327200, -71.17373800)
515075	020202856	2672	NaN	BIOLOGICAL THREATS	B2	282	0	2020-03-19 08:30:00	2020	3	Thursday	8	NaN	WARREN ST	42.328234	-71.083289	(42.32823400, -71.08328900)
515076	020063425	3114	NaN	INVESTIGATE PROPERTY	A7	21	0	2020-09-01 00:00:00	2020	9	Tuesday	0	NaN	PARIS ST	42.374426	-71.035278	(42.37442600, -71.03527800)
515077	020062356	3115	NaN	INVESTIGATE PERSON	E18	520	0	2020-08-28 18:39:00	2020	8	Friday	18	NaN	HYDE PARK AVE	42.256215	-71.124019	(42.25621500, -71.12401900)
515078	020054040	3501	NaN	MISSING PERSON	C11	0	2020-07-30 15:30:00	2020	7	Thursday	15	NaN	GIBSON ST	42.297555	-71.059709	(42.29755500, -71.05970900)
515079	020046400	1501	NaN	WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR…	B2	330	0	2020-07-02 01:38:00	2020	7	Thursday	1	NaN	PASADENA RD	42.305760	-71.083771	(42.30576000, -71.08377100)
515080	020038446	1501	NaN	WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR…	B2	300	0	2020-06-03 01:15:00	2020	6	Wednesday	1	NaN	WASHINGTON ST	42.323807	-71.089150	(42.32380700, -71.08915000)
515081	020030892	540	NaN	BURGLARY - COMMERICAL	C11	380	0	2020-05-03 00:00:00	2020	5	Sunday	0	NaN	GALLIVAN BLVD	42.283700	-71.047761	(42.28370000, -71.04776100)

Let us look at the contents of the>OFFENSE_CODEYEARMONTHHOURLatLongcount515082.000000515082.000000515082.000000515082.000000485909.000000485909.000000mean2333.2756322017.5429336.63419413.07917042.239043-70.949353std1182.4898221.5433293.3179646.3472591.8916453.060012min111.0000002015.0000001.0000000.000000-1.000000-71.20331225%1102.0000002016.0000004.0000009.00000042.296861-71.09746550%3005.0000002018.0000007.00000014.00000042.325029-71.07772375%3201.0000002019.0000009.00000018.00000042.348312-71.062562max3831.0000002020.00000012.00000023.00000042.3950420.000000

By default the command will report the values with decimals but we may not want that. Decimals can be rounded or removed altogether as shown below.

OFFENSE_CODE	YEAR	MONTH	HOUR	Lat	Long
count	515082.00	515082.00	515082.00	515082.00	485909.00	485909.00
mean	2333.28	2017.54	6.63	13.08	42.24	-70.95
std	1182.49	1.54	3.32	6.35	1.89	3.06
min	111.00	2015.00	1.00	0.00	-1.00	-71.20
25%	1102.00	2016.00	4.00	9.00	42.30	-71.10
50%	3005.00	2018.00	7.00	14.00	42.33	-71.08
75%	3201.00	2019.00	9.00	18.00	42.35	-71.06
max	3831.00	2020.00	12.00	23.00	42.40	0.00

OFFENSE_CODE	YEAR	MONTH	HOUR	Lat	Long
count	515082.0	515082.0	515082.0	515082.0	485909.0	485909.0
mean	2333.0	2018.0	7.0	13.0	42.0	-71.0
std	1182.0	2.0	3.0	6.0	2.0	3.0
min	111.0	2015.0	1.0	0.0	-1.0	-71.0
25%	1102.0	2016.0	4.0	9.0	42.0	-71.0
50%	3005.0	2018.0	7.0	14.0	42.0	-71.0
75%	3201.0	2019.0	9.0	18.0	42.0	-71.0
max	3831.0	2020.0	12.0	23.0	42.0	0.0

Note a few things here.

We have a total of 515082 incidents logged. But the latitude and longitude are availale for no more than 485909 incidents.

Say we want to restrict the dataframe just to 2020. How can we do that?

Notice the sequence here dataframe[ dataframe['column-name'] somevalue ] & pay attention to the double equal sign which is a strict equality.

OFFENSE_CODE	YEAR	MONTH	HOUR	Lat	Long
count	63733.000000	63733.0	63733.000000	63733.000000	62200.000000	62200.000000
mean	2353.137323	2020.0	4.900554	12.923525	42.319872	-71.084193
std	1182.670996	0.0	2.561463	6.566899	0.032339	0.030578
min	111.000000	2020.0	1.000000	0.000000	42.181845	-71.203312
25%	1001.000000	2020.0	3.000000	9.000000	42.295353	-71.098579
50%	3005.000000	2020.0	5.000000	14.000000	42.321918	-71.078444
75%	3207.000000	2020.0	7.000000	18.000000	42.344561	-71.062000
max	3831.000000	2020.0	9.000000	23.000000	42.395041	-70.953726

At this point we might be curious to know what types of offenses are most often reported? Before we that, however, let us also see how many unique values of OFFENSE_CODE are there

So code 3301 leads with 6234 reports in 2020, followed by code 3115, then 801, then 3005, and then 3831. Code 3005 is missing from their list so we have no idea what it is!! That is a crime in itself.

OFFENSE_CODE	count
109	3301	6234
95	3115	5494
23	801	3908
82	3005	3227
129	3831	2700
…	…	…
86	3016	2
26	990	1
106	3203	1
76	2672	1
65	2628	1

Ipynb To Markdown Free

130 rows × 2 columns

Ipynb Markdown Table

Not bad. I suppose with forcing fixed headers and some other aesthetic tuning of the table rendering this could be a pretty efficient solution to add python notebooks. I am really coming to love distill!!